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(57) Abstract: A method of generating a search result list also provides related searches for use by a searcher. Search listings which 
generate a match with a search request submitted by the searcher are identified in a pay-for-perforrnance database which includes 
a plurality of search listings. Related search listings contained in a related search database generated from the pay -for- performance 
database are identified as relevant to the search request. A search result list is returned to the searcher including the identified search 
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METHOD AND APPARATUS FOR IDENTIFYING RELATED 
SEARCHES IN A DATABASE SEARCH SYSTEM 

APPENDIX / COPYRIGHT REFERENCE 



is subject to copyright protection. The copyright owner has no objection to the 



as it appears in the U.S. Patent and Trademark Office patent file or records, but 
otherwise reserves all copyright rights whatsoever. 

An Appendix of computer program source code is included herewith. The 
Appendix is hereby expressly incorporated herein by reference, and contains 
material which is subject to copyright protection as set forth above, 

BACKGROUND OF THE INVENTION 

The present invention relates generally to a method and system for 
generating a search result list, for example, using an Internet-based search engine. 
More particularly, the present invention relates to a method and system for 
generating search results from a pay for performance database and generating a list 
of related searches from a related search database- 
Search engines are commonly used to search the information available on 
computer networks such as the World Wide Web to enable users to locate 
information of interest that is stored within the network. To use a search engine, a 
user or searcher typically enters one or more search terms that the search engine uses 
to generate a listing of information, such as web pages, that the searcher is then able 
to access and utilize. The information resulting from the search is commonly 
identified as a result of an association that is established between the information and 
one or more of the search terms entered by the user. Different search engines use 
different techniques to associate information with search terms and to identify related 
information. These search engines also use different techniques to provide the 
identified information to the user. Accordingly, the likelihood of information being 
found as a result of a search varies depending upon the search engine used to perform 
the search. 
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This uncertainty is of particular concern to web page operators that make 
information available on the World Wide Web. In this setting, there are often several 
web page operators or advertisers that are competing for the same group of potential 
views or customers. Accordingly, a web page's ability to be identified as the result 
5 of a search is often important to the success of a web page. Therefore, web page 

operators often seek to increase the likelihood that their web page will be seen as the 
result of a search. 

One type of search engine that provides web page operators with a more 
predictable method of being seen as the result of a search is a "pay for performance'" 7 

10 arrangement where web pages are displayed based at least in part upon a monetary 

sum that the advertiser or web page operator has agreed to pay to the search engine 
operator. The web page operator agrees to pay an amount of money, commonly 
referred to as the bid amount, in exchange for a particular position in a set of search 
results that is generated in response to a user's input of a search term. A higher bid 

15 amount will result in a more prominent placement in a set of search results. Thus, a 

web page operator may attempt to place high bids on one or more search terms to 
increase the likelihood that their web page will be seen as a result of a search for that 
term. However, there are many similar search terms, and It is difficult for a web 
page operator to bid on every potentially relevant search term. Likewise, it is 

20 unlikely that a bid will be made on every search term. Accordingly, a search engine 

operator may not receive any revenue from searches performed using certain search 
terms for which there are no bids. 

In addition, because the number of existing web pages is ever increasing, it is 
becoming more difficult for a user to find relevant search results. The difficulty of 

25 obtaining relevant search results is further increased because of the search engine's 



ins entered oy txic user, i uc scarun results tnat a user 
receives are directly dependent upon the search terms that the user enters. The entry 
of one search term may not result in relevant search results, while the entry of only 
a slightly different search term can result in relevant search results. Accordingly, the 
30 selection of search terms is often an important part of the search process. It would be 

of benefit to both the searcher and the advertisers to recommend related searches for 
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the searcher to provide to the search engine. However, current search engines do not 
enable a search engine operator to provide related search terms, such as those that 
will produce relevant search results, to a user, A system that overcomes these 
deficiencies is needed, 

5 SUMMARY 

By way of introduction only, in accordance with one embodiment of the 
invention, a search request is received from a searcher and used to perform a 

1_ _ - _C _ - - J-t-1 T . -I c r- 1.1 
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there are stored search listings including web page locators and bid amounts to be 
10 paid by the operator of the listed web page. The search using the pay for 

performance database produces search results which are presented to the searcher. 
The search request is also used to perform a search on a related search database. 
The related search database has been formed at least in part using contents of the 
pay for performance database. The search of the related search database produces 
15 a list of related searches which are presented to the searcher. 

In accordance with a second embodiment, a related search database is 
created using a pay for performance database, AH text from all web pages 
referenced by the pay for performance database is stored and used to create an 
inverted index. Additional indexes are used to improve the relevancy and spread 
20 of related search results obtained using the database. 

The foregoing discussion of illustrative embodiments of the invention has 
been provided only by way of introduction. Nothing in this section should be 
taken as a limitation on the following claims, which define the scope of the 
invention, 

25 BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating a database search system in 
conjunction with a computer network; 

FIG. 2 is a flow diagram illustrating a method for operating the database 
search system of FIG. I; 
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FIG, 3 is a flow diagram illustrating a method for operating the database 
search system of FIG. 1; 

FIG. 4 is a flow diagram illustrating in more detail a portion of the method 
shown in FIG, 2; 

5 FIG. 5 is a flow diagram illustrating in more detail a portion of the method 

FIG. 6 is a flow diagram illustrating a method for forming a related 
searches database; and 

FIG. 7 is a flow diagram illustrating a method for removing similar page 
10 information from a database. 

DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED 
EMBODIMENTS 

Referring now to the drawing, FIG. 1 is a block diagram of a database 
search system 100 shown in conjunction with a computer network 102. 

The database search system 100 includes a pay for performance database 
104, a related searches database 106, a search engine web server 108, a related 
searches web server 110 and a search engine web page 114. The servers 104, 106, 
108 may be accessed over the network 102 by an advertiser web server 120 or a 
client computer 122, 

The network 102 in the illustrated embodiment is the Internet and provides 

flat a r n rn -m 1 1 n i r 3 f i nn 3CC Or din cr to atJijrOTirififp' Qfannarns <Jiinn ^ ^ mtpmp>t rV/^f^^^l 

In other embodiments, other network systems may be used alone or in conjunction 
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Internet Protocol or similar data communication standard. Other data 
communications standards may be used as well to ensure reliable communication 
of data. 

The database search system 100 is configured as part of a client and server 
architecture. In the context of a computer network such as the Internet, a client is 
a process such as a program, task or application that requests a service which is 
provided by another process such as a program, task or application that requests a 
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service which is provided by another process, known as a server program. The 
client process uses the requested service without having to know any working 
details about the other server program or the server itself. In networked systems, a 
client process usually runs on a computer that accesses shared network resources 
5 provided by another computer running a corresponding server process. A server is 

tvnieallv a remote comnuter svstem that iq arrpQ&iWIp 

"J XT' WVAAApWL ~ J «*~w *^ m.www^v^avav v T VJ. W VW*milUUAVUMWUJ 

medium such as a network. The server acts as an information provider for a 
computer network. Thus, the system 100 operates as a server for access by the 
clients such as client computer 122 and the advertiser web server 120. 

10 The client computers 122 can be conventional personal computers, 

workstations or computer systems of any size. Each client computer 112 typically 
includes one or more processors, memory, input and output devices and a network 
interface such as a modem. The advertiser web server 120, the search engine web 
server 108, the related searches web server 1 10 and the account management web 

15 server 112 can be similarly configured. However, the advertiser web server 120, 

the search engine web server 108, the related searches web server 110 and the 
account management web server 112 may each include many computers 
connected by a separate private network. 

The client computer 112 executes a World Wide Web ("web") browser 

20 program 124. Examples of such a program are Navigator, available from 

Netscape Communications Corporation and Internet Explorer, available from 
Microsoft Corporation, The browser program 124 is used by a user to enter 
addresses of specific web pages to be retrieved. These addresses are referred to as 
Uniform Resource Locators (URLs). In addition, once a page has been retrieved, 

25 the browser program 124 can provide access to other pages or records when the 

ncevt* r~i\ \ t-*\ra> k\ x j=*-»*1 i 1 - «-» *- — , ^■#-'U.^«. tfral^ ^ ~ ~ ~ -~ 1^ J 1 4.t» ^ IT . A L ^ f"i 1 
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hyperlinks provide an automated way for the user to enter the URL- of another 
page and to retrieve that page. The pages can be data records including as content 
plain textual information or more complex digitally encoded multimedia content 
30 such as software programs, graphics, audio data, video data and so forth. 
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Client computers 122 communicate through the network 102 with various 
network information providers. These information providers include the 
advertiser web server 120, the account management server 112, the search engine 
server 108, and the related searches web server 1 10, Preferably, communication 
functionality is provided by HyperText Transfer Protocol (HTTP), although other 

Communication nrOtOCO^ cnr4i as FTP MA/TP TVI-n^t smrS q rni trili i-*t~ of ofnPr 



protocols known in the art may be used. Preferably, search engine server 108, 
related searches server 1 10 and account management server 1 12, along with 
advertiser servers 120 are located on the worldwide web. U.S. Patent Application 

10 Number 09/322,627, filed May 29, 1999 and entitled "System and Method for 

Influencing a Position on a Search Result List Generated by a Computer Network 
Search Engine," and U.S. Patent Application No. 09/494,818, filed January 31, 
2000 and entitled "Method and System for Generating a Set of Search Terms," are 
commonly assigned to the assignee of the present application and are incorporated 

15 herein by reference. These applications disclose additional aspects of search 

engine systems. 

The account management web server 1 12 in the illustrated embodiment 
includes a computer storage medium such as a disc system and a processing 
system. A database is stored on the storage medium and contains advertiser 
20 account information. Conventional browser programs 124, running on client 

computers 122, may be used to access advertiser account information stored on the 
account management server 112, 

The search engine web server 108 permits network users, upon navigating 
to the search engine web server URL or sites on other web servers capable of 
25 submitting queries to the search engine web server 108 through a browser program 

JL T. LW L>UU JVV^ V VV WJL \_i *J U^llL,^ IA J IllCii III I V I Itl V t^.-S I ] I I I I I tTZI I I ^TIIIVJIIJ^ 1.1 I I III I I II II IN 111 
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pages available on web pages. In one embodiment of the present invention, the 
search engine web server 108 generates a search result list that includes, at least in 
part, relevant entries obtained from and formatted by the results of the bidding 
30 process conducted by the account management server 112, The search engine web 

server 108 generates a list of HyperText links to documents that contain 
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information relevant to search terms entered by the user at a client computer 122. 
The search engine web server transmits this list, in the form of a web page 1 14 to 
the network user, where it is displayed on the browser 124 running on the client 

r»ArMTMif/ai« 1 OO fWi^k /a-rviV\/-\/-i-!i-*-»ck'n t at rna c*o»o-rr»Vk on mn<a itron cat^xra-f mix; TfYlirin l"*\7 

5 navigating to the web page at URL http://www.goto.com/. 

Search engine web server 108 is connected to the network 102. In one 
embodiment of the present invention, search engine web server 108 includes a pay 
for performance database including a plurality of search listings. The database 
104 contains and ordered collection of search listing records used to generate 

10 search results in response to user queries. Each search listing record contains the 

URL of an associated web page or document, a title 5 descriptive text and a bid 
amount. In addition, search engine web server 108 may also be connected to the 
account management server 112. The account management server 112 may also 
be connected to the network 102. 

15 In addition, in the illustrated embodiment of FIG. I 5 the database system 

100 further includes a related searches web server 1 10 and an associated related 
searches database 106. The related searches web server 110 and database 106 
operate to provide suggested, related searches for presentation to a searcher along 
with search results in response to his query. Users conducting searches for 

20 information using a search engine web server such as the server 108 often perform 

searches which are inappropriately focused as compared to the index data of the 
web site search engine. Users may use search terms which are either to vague and 
generalized, such a "music." or too specific and focused, such as "hot jazz from 

\t /"x .1 j__ " _ ^.i i__ 1 nrn 99 ri . • ■ , r" _ 
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25 their query to better obtain useful information from the search engine. The related 

searches web server 1 10 provides the user' with query suggestions better suited to 

the abilities of the pay for performance database 104, 

In the illustrated embodiment the pay for performance database 104 is 

established in conjunction with advertisers who operate web servers such as 
30 advertiser web server 120, Advertiser web pages 121 are displayed on the 

advertiser web server 120. An advertiser or web site promoter may, through an 
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account residing on the account management server 112, participate in a 
competitive bidding process with other advertisers. An advertiser may bid on any 
number of search terms relevant to the content of the advertisers web site. 
The bids submitted by the web site promoters are used to control 
5 presentation of search results to a searcher using client computer 122, Higher bids 

rPrPivf 3 mnfp 3 H \/~ d t"i t \a Ct Pnil Q i"i 1 Si r- P> m P* f 1 i r\ti Q CPPTPn r^cillf lior rr*=*ri Pra t/^rl r^i/ rno 

^ — ^ — ^ina^t r ,.v*™*^w^v acaiuii icauiL 5 ^^iu,vu 

search engine web server 108 when a search using the search term bid on by the 
advertiser is executed. In one embodiment, the amount bid by an advertiser 
comprises a money amount that is deducted from the account of the advertiser 

10 each time the advertiser web site is accessed via a hyperlink on the search result 

list page, A searcher clicks on the hyperlink with a computer input device such as 
a mouse to initiate a retrieval request to retrieve the information associated with 
the advertiser's hyperlink. Preferably, each access or click on a search result list 
hyperlink is redirected to the search engine web server 108 to associate the click 

15 with the account identifier for an advertiser. This redirection action, which is not 

apparent to the searcher, will access account information coded into the search 
result page before accessing the advertiser's URL using the search result list 
hyperlink clicked on by the searcher. In the illustrated embodiment, the 
advertiser's web site description and hyperlink on the search result list page is 

20 accompanied by an indication that the advertiser's listing is a paid listing. Each 

paid listing displays an amount corresponding to a price per click paid by the 
advertiser for each of referral to the advertiser site through this search result list. 

The searcher may click on HyperText links associated with each listing in 
that search result page to access the corresponding web pages. The HyperText 

25 links may access web pages anywhere on the Internet, and include paid listings to 

nrt^r/arfif at* n;/aV\ r\arrar> 1 O 1 1/\/-inf^^4 4-U. n/jt>a4^-4nA«« t«mk et£±w fa«- 1 OH T „ *' 
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embodiment of the present invention, the search result list also includes non-paid 
listings that are not priced as a result of advertiser's bids and are generated by a 
conventional search engine, such as the Inktomi, Lycos, or Yahoo I search engines. 
30 The non-paid HyperText links may also include links manually indexed into the 
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pay for performance database 104 by an editorial team. Preferably, the nor. -paid 
listings follow the paid advertiser listings on the search results page. 

Related searches web server 1 10 receives the search request from the 
searcher at client computer 122 as entered using the search engine web page 114. 
5 In the related searches database 106, which includes related search listings 

generated from the nav for performance database 1 04. the related searches web 

r - -J J- - - - _ „ - - , _ .. 

server 110 identifies related search listings relevant to the search request. In 
conjunction with the search engine web server 108, the related searches web 
server 110 returns a search result list to the searcher including the identified search 

10 listings located in the nay for performance database and one or more identified 

related search listings located in the related searches database 106. Operation of 
the related searches web server 110 in conjunction with the related searches 
database 106 will be described below in conjunction with FIGS, 2-5. The 
formation of the related searches database 106 will be described below in 

1 5 conj unction with FIG. 6. 

FIG. 2 is a flow diagram illustrating a method for operating the database 

nao-r/VU r-.,,„+ rtw , 1 mm ~-FlTT/~ i HTki» w%oi\\r\A Kom«o ui««tr nr\n T , 

SGCiXWil ^_y5Lt?JL±± i wij i_i j_ x JL y 1 . Ij J. JLIC"- IXSS^- II L\J LI UC^^XIIJ} ClL L' JL O C IS~ JiLW'd. SOUrCC COClC 

for implementing the method of FIG. 2 and other method steps described herein is 
included as an appendix. 

20 At block 202, a search request is received. The search request may be 

received in any suitable manner. It is envisioned that a search request will 
originate with a searcher using a client computer to access the search engine web 
page of the database system implementing the method illustrated in FIG. 2. A 
search request may be typed in as input text in a hyperlink click to initiate the 

25 m search request and search process. 

Affpr Kindt OuO htrrv nnrailaii nmoAppac or/a tnifiifon A r "U"! OPi/i t-Vi^ 

X i-X LWA L/JIWU LVY\J L^CI-JL LJX VJl^V^CS l3 \^>d C4JL W JUL XX tl CILV^U. . L UIVJLjIV I*-T- Li 11^ 
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search engine web server of the database search system identifies matching search 
listings in the pay for performance database of the system. In addition, the search 
engine web server may farther identify non-paid search listings. 
30 Similarly, at block 206, a related searches web server initiates a search to 

identify matching related search listings in the related search database. By 



WO 01/90947 



PCT/US0i/i6i6i 



10 

matching search listings, it is meant that the respective search engine identifies 
search listings contained in the respective database which generate a match with 
the search request. A match may be generated if an exact, letter for letter textual 
match occurs between a bid on keyword and a search term. In other embodiments, 
5 a match may be generated if a bidded keyword has a predetermined relationship 

unrn 'i c^arpn ro-rm Tnr^-r pvam-nlo frn=» nf An f»f Ar m i n Arl rAiorinnciiin mot) -It-i^ln /-3 a 

YVJH.J.J. M. iJ\^-t4.JL \~>1X UV^-J. JLJLJL. A WJL VAUHAyiU. JU/X dJLwLWX 11JJL11 WW 1 V/iULlUllOlllL/ ±i±CLV lliV/lUVJ.L/ 

matching the root of a word which has been stripped of suffixes; in a multiple 
word query, matching several but not all of the words; or locating the multiple 
words of the query with a predetermined number of words of proximity, 

10 After the search results have been located, the search results from the pay 

for performance database are combined with search results from the related search 
database, block 208, At block 210, a search result list is returned to the searcher, 
for example by displaying identified search listings on the search engine web page 
and conveying the web page data over the network to the client computer. The 

15 search results and related search results may be displayed in any convenient 

fashion. 

An example of a search result list display used in one embodiment of the 
present invention is shown in FIG. 3, which is a display of the first several entries 
resulting from the search for the term "CD burners/ 5 The exemplary display of 

20 FIG. 3 shows a portion of a search result list including a plurality of entries 310a, 

310b, 310c, 310d, 310e, 310f, 310g, 310h, 310i, a listing 312 of other search 
categories and a related searches listing 314. 

As shown in FIG, 3, a single entry, such as entry 310a in the search result 
list consists of a description 320 of the web site, preferably comprising a title and a 

25 short textual description, and a hyperlink 330 which, when clicked by a searcher, 

directs cne searcner browser iu uie UKJL wnere uie uescrioea weo site is located. 
The URL 340 may also be displayed in the search result list entry 310a, as shown 
in FIG. 3. The "click through" of a search result item occurs when the remote 
searcher viewing the search result item display 310 of FIG, 3 selects or clicks on 

30 the hyperlink 330 of the search result item display 3 10. 
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Search result list entries 310a-310h may also show the rank value 360a, 
360b 9 360c, 360d, 360e, 360f 5 360g, 360h, 360i of the advertisers' search listing. 
The rank value 360a-360i is an ordinal value, preferably a number, generated and 
assigned to the search listing by the processing system of the search engine web 
5 server. Preferably, the rank value 360a-360i is assigned in a process, implemented 

in software, that establishes an association between the bid amount, the rank, and 
the search term of a search listing. The process gathers all search listings that 
match a particular search term, sorts the search lis tin ss in order from highest to 
lowest bid amount, and assigns a rank value to each search listing in order. The 

10 highest bid amount receives the highest rank value 5 the next highest bid amount 

receives the next highest rank value, proceeding to the lowest bid amount, which 
receives the lowest rank value. The correlation between rank value and bid 
amount is illustrated in FIG, 3 5 where each of the paid search list entries 
310a-310h display the advertiser's bid amount 350a, 350b, 350c, 350d, 350e, 

15 3 5 Of, 350g, 350h, 350i for that entry. If two search listings having the same 

search term also have the same bid amount, the bid that was received earlier in 

LllllV^ VV 111 CLtTi vllill l^V-L LllV^ I I I t 1 U^l I fXl I IV I J I V fl I 111 

a 

The search result list of FIG. 3 does not include unpaid listings. In the 
preferred embodiment, unpaid listings do not display a bid amount and are 

20 displayed following the lowest ranked paid listing. Unpaid listings are generated 

by a search engine utilizing object distributed database and text searching 
algorithms as known in the art. An example of such a search engine is the search 
engine operated by Inktomi Corporation, The original search query entered by the 
remote searcher is used to generate unpaid listings through the conventional search 

25 engine. 

for searching that may be related to the searcher's input search term 316. The 
other search categories are selected for display by identifying a group such as 
computer hardware containing the input search term 316. Categories with in the 
30 group are then displayed as hyperlinks which may be clicked through by the 
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searcher for additional searches. This enhances the user's convenience in cases 
where the user's input search did not turn up suitable search results. 

The related searches listing 314 displays six entries 318 of related searches 
determined using the related searches database as described herein. In other 
5 embodiments, other numbers of related search entries may be show. In addition, a 

link 320 labeled "more" allows the user to display additional related search 
entries. In the illustrated embodiment, the displayed entries 318 are the top six 
most relevant and most bidded-on terms in the related searches database. 

Referring now to FIG, 4, the act of identifying matching related search 
10 listings in a related search database (act 206 5 FIG, 2) in one embodiment 

comprises the following acts. At block 400, an inverted index containing all data 
from all web pages contained in the pay for performance database of the database 
search system is searched. The inverted index is stored in the related searches 
database. In an inverted index, a single index entry is used to reference many 
15 database records. Searching for multiple matches per index entry is generally 

taster when using inverted indexes, since each index entry may reference many 

VLCXi.ci i / o r> ^ IC^UiUOs i iiw in y v^i .1 JU_ i^/S_ l.i ;> i_ 0 LIIC WU1U5 VYilj.V'J.l l _' CI I J. UG .~> r^/^i UQ.CU JLJLI ^ JLKJJ. 

example, alphabetical order and accompanying each word are pointers which 
identify the particular documents which contain the word as well as the locations 
within each document at which the word occurs. To perform a search, instead of 
searching through the documents in word order, the computer locates the pointers 
for the particular words identified in a search query and processes them. The 
computer identifies the documents which have the required order and proximity 
relationship for the search query terms. 

At block 402 5 meta-information is also searched for the received search 

i Pirn A/TfirQ_-lnfr\t*mari An 10 oV»cri-or>fon ^o.raim ai r a/4 -5-»-»-P/^f-»-v-»*-»t-i/-i*^ nhrtnf f*Tif* 

iVAAJlii JLTAV/IU lllAVJLIllUtlUll IO CLLJd LA aULV^\-l, V_/JLJ.V_>V>~~_L\_/111W V l^ll II li_VJJ. lXXCtLilJil dUUUX LllC 

collected data itself and forms a description of the data. Meta-information is 
derived information and relational information. Meta-information for a listing 
describes the relation of the listing to other listings, and meta-information for a 
listing describes the relation of the advertisers sponsoring a listing to other 
advertisers. 



20 



25 
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Meta-information is obtained using a script of command to analyze the pay 
for performance data base and determine information and relationships present in 
the data. The meta-information is collected for each row of data in the database 
and attached to that row. In one embodiment, the script is run one time as a batch 
5 process after the data is collected in the database. In other embodiments, the script 

is "periodic till v re-run to update the met a ~ in for mat! on 

Meta-information about the web pages and key words contained in the pay 
for performance database includes information such as the frequency of 
occurrence of similar key words among different web site domains and the number 
10 of different key words associated with a single web site. The meta-information 

may further include fielded advertiser data which is the information contained in 
each search listing provided by web site promoters who have bid upon search 
terms in the pay for performance database: advertiser identification information; 
web site themes, such as gambling or adult content;, and derived themes. 
15 Preferably, the meta-information is combined in a common inverted index with the 

stored web page data searched at block 400= 

The result of the searches of block 400 and block 402 is a listing of rows of 
the inverted index or indexes containing the searched information. Each row 
contains the information associated with a search listing of the pay for 
20 performance database along with all the text of the web page associated with the 

search listing. In the illustrated embodiment, the search listing includes the 
advertisers search terms, the URL of the web page, a title and descriptive text. 

At block 404, the returned related search results are sorted by relevancy. 
Any suitable sorting routine may be used. A preferred process of sorting the 
25 search results by relevancy, block 404, is illustrated in greater detail in FIG. 5. 

/-VL diock tuu 5 uic six jljlius l relevant xciatsu searcn results are seiectea. ii is 
to be noted that any suitable number of search results may be provided. The 
choice of providing six related searches as suggestions to a searcher is arbitrary. 
After block 406, control proceeds at block 208, FIG, 2. 
30 FIG. 5 is a flow diagram illustrating a method for sorting by relevancy 

search results obtained from a related searches database, corresponding to 
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block 404 of FIG. 4. In the embodiment illustrated in FIG. 5, a relevancy value is 
maintained for each returned listing. The relevancy value is adjusted according to 
specific relevancy factors, some of which are defined in FIG. 5. Other relevancy 
factors may be used as well. After adjusting the relevancy value, a final sorting 
5 occurs and the highest valued listings are returned. 

search (block 400, block 402, FIG. 4) are increased according to the frequency of 
occurrence of a queried search term in each respective record, For example, if the 
queried search term occurs frequently in the text associated with the search listing, 
10 the relevancy of that listing is increased. If the queried search term occurs rarely 

or not at all in the listing, the relevancy value of that list is not increased or is 
decreased. 

At block 502, it is determined if there are multiple search terms in the 
search queries submitted by the searcher. If not, control proceeds to block 506. If 

15 there are multiple search terms, at block 504, the relevancy of individual search 

results is increased according to proximity of the searched terms in a located 
record. Thus, if two search terms are immediately proximate, the relevancy score 
value for the record may be substantially increased, suggesting that the identified 
search listing is highly relevant to the search query submitted by the searcher. On 

20 the other hand, if the two search terms occur, for example, in the same sentence 

but not in close proximity, the relevancy of the record may be slightly increased to 
indicate the lesser relevancy suggested by the reduced proximity of the search 
terms. 

At block 506, it is determined if the located record contains a bidded search 
25 term. Search terms are bidded on by advertisers, the bids being used for display of 

owcul wii s. coUiLo u y Luc sc-itiuii engine vvcu server usin^ me "pav ror jjcfiuriiiuiics 
database. If the search result does include a bided on search term, the relevancy of 
the record is adjusted, block 508. If the query does not include one or more bided 
on search terms, control proceeds to block 510. 
30 At block 5 10, it is determined if there are search terms in the description of 

the search listing. As illustrated in FIG, 3, each such listing includes a textural 
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description of the contents of the web site associated with the search listing. If the 
search terms are not included in the description, control proceeds to block 5 14. If 
the search terms are included in the description, at block 51 2 5 the relevancy of the 
located record is adjusted accordingly. 
5 At block 514, it is determined if the search terms are located in the title of 

the search listing. As illustrated in FIG. 3, each search listing includes a title 360. 
If the search terms are included in the title of a record, the relevancy of the record 
is adjusted accordingly, block 516. If the search terms are not included in the title, 
control proceeds to block 518. 

10 At block 518, it is determined if the search terms are included in the 

metatags of the search listing. Metatags are textural information included in a web 
site which is not displayed for user use. However, the search listing contained in 
the pay-for-performanee database includes the metatags for searching and other 
purposes. If, at block 518, the search terms are not included in the search listing, 

15 control proceeds to block 522. On the other hand, if the search terms are included 

in one or more metatags of the search listing, at block 520 the relevancy of the 

JL t-WUiij. J.O £ii 1 1 "s i r~ ,i i 4X\ ,v_.t n i mi v i v 
J & J " 

At block 522 it is determined if the user's search terms are included in the 
text of the bided web page. If not, control proceeds to block 406, FIG. 4. 
20 However, if the search terms are included in the web page text, at block 524 the 

relevancy of the search listing record is adjusted accordingly. 

Following the steps illustrated in FIG, 5, one or more and preferably six 
most relevant related search listings are returned and presented to the searcher 
along with the search results from the pay-for-performance database. 
25 . FIG, 6 illustrates a method for forming a related searches database for use 

uutuuuou O^Ol^H O _Y U LV^IAJL A JLVJ. JL * 1. HC I. JJ.C 1 JL I \J VJ W^^XIiO CLL muL-JS. WWW- 

At block 602, ail text for all web pages in the pay-for-performance database 
is fetched. This includes metatags and other non-displayed textual information 
contained in the web page referenced by a URL contained in the pay for 
30 performance database. At block 604 ; text from similar pages is omitted. This 

reduces the amount of data which must be processed to form the related searches 
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database. One embodiment of a method for performing this act will be described 
below in conjunction with FIG, 7. In addition, this greatly increases the speed at 
which the related searches database may be produced. At block 606, the text is 
stored in the related searches database. 

At block 608, an inverted index is created, indexing the search listing data 
stored at block 606 along with the text fetched at block 602, The resulting 
inverting index includes a plurality of rows of data, each row including a key word 
along with all text from the database associated with that key word. 

One illustrative example of a configuration for the contents of the related 
search database follows. Each row of the database includes the following 
elements: 



canon_cnt 
advertisement 
related result 



integer # Number of different search listings bidded 

on this related result 

integer # Number of different advertisers bidding on 

this related result 

varchar(50) # related result (bidded search term), 
canonicalized and der>luralized 



raw searcii jext varchar(50) # original raw bidded searcn term 



advertiser ids 



varchar(4096)# explicit list of all advertisers bidding on this 
related result 



v oi Cnai « v DO D o Q-r) 



ft inn text ui an wcu pages crawieu, 



including hand coded descriptions 



theme 



varchar(50) 



direetory_taxonomy varchar(200) 



The count canon__cnt differs from the count advertisement because many 
different web pages in the same domain could be bidded against the same bidded 
search term, or many different advertisers may bid on only 1 search term. Special 
themea keys are tab as e with 'flags' inserted in the 



advertisement field. If 'advertisement == 999999999*, the query being presented 
is an adult-oriented query. In this implementation, an optional enhancement is to 
disable related results in this case. The counts canon_cnt and advertisement are 
the current derived-data fields. Additional fields such as theme and 
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directory_taxonomy_category can optionally be added to give even more 
enhanced relevance to related results matches, though they are not used in the 
illustrated embodiment. 

In one embodiment, the inverted index which is queried against to obtain 
the related results is created with the following Java command: 



SQL> Create metamorph inverted index mm_index02 on Iine„ad02( words) ; 
This is the vendor-specific method (using the Tex is relational database 

management system provide by Thunderstone- EPI, Inc.) for creating a free-text 



10 (words)) which will be searched (from RelatedSearcherCore Java) by the Texis 

Thunders tone SOL command: 
"SELECT" 

-i- "$rank, " //Num getRowQ arg position 0 

-f "canon_cnt, " //Itit getRowQ arg position 1 
15 -r "raw_seareh__text ? " //Stri getRowQ arg position 2 

4- "cannon_search_text, " //Stri getRowQ arg position 3 
+ "advertiser_ids ? " //Stri getRowQ arg position 4 
+ "advertisement " //Int getRowQ arg position 5 

"FROM Hn^ nridO !! 

I X 1VV/HJI 11 11 Ci.S~l\^f^ 

20 + "WHERE words 11 

+ "LIKEP $query ORDER BY 1 desc, advertisement desc;"; 

The $rank is a vendor-supplied virtual data field which programmatically 
contains the "relevancy" of the search result, based on the frequency of occurrence 
25 of the queried phrase ($query) in the "words" field, the proximity of the queried 

phrase elements to each other within the indexed words field, and the word order 
(if > 1 query phrase word) as compared to the ordering of words within the 
"words" field. 

The "rank" is vendor-specific, and derived by various differing algorithms 

j ^ J." I 7 C?~ O 

practice that any vendor's Free Text Search Engine works to implement the 
Related Searches Functionality. 

The "ORDER BY 1 descending], advertisement descending]" controls 
ranking the results of the query by relevance primarily (field " 1 " = $rank)- 7 and 
35 secondarily by the derived field "advertisement", which is the count of advertisers 
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bidding on this particular related_search_result. Thus, "relevance" is the primary 
selection criteria, and "popularity" is the secondary selection criteria. 

At block 610, additional indexes are created and stored with the inverted 
index created at block 608. The additional indexes are created using key 
5 information associated with each search listing. The key information includes, for 

example 3 fielded advertiser data such as an advertiser's identification and derived 
themes such as gambling and so forth. The method then ends at block 612. 

T^Ti^t 7 ic q flmu ri i d rrr'jm -i i In crrann cr o morrn^ri -r!rv*- i«a»v»mH«rr oimilot* *^ n rv/^ 

A / W4. JLIUr* U.1JLJL lllUOLl ULlll^. C4. XAXU/ L11UU 1U1 1 U11IU V.lllt C1AXXXAXCXX LJCLtZ ^ 

information from a database. The method in the illustrated implementation 

10 follows performance of act 602 of FIG, 6. 

At block 702. the pay for performance database (also referred to as a 
bidded search listing data base) is examined for URL data and all URLs are 
extracted from the database and formed into a list. The list is sorted and any exact 
duplicates are removed, block 704 

15 At block 706, a URL in the list is selected and it is determined if the 

selected URL bears similarity to a preceding URL in the list. Similarity may be 
determined by any suitable method, such as a number of identical characters or 
fields within the URL or a percentage of identical characters, or a common root or 
string or field. 

20 At block 708, if the selected URL is similar to the preceding URL, the 

selected URL is added to a list of candidate duplicate URLs, At block 710, a 
predetermined number of each potentially duplicate URL are crawled. In the 
illustrated embodiment, the predetermined number is the first two potentially 
duplicate URLs, Crawling is preferably accomplished using a program code 

25 referred to as a crawler, A crawler is a program that visits Web sites and reads 

Lixvii jJa.g\sO i4.j.av«- vwjivji ii.xxwAJ.AXCLi.xwii, o LiL,i i w v^_L CIXIJ.O Cli. ^ win XVJLJ. \J YY IX CLLLKX ClJLC- CIJL-5L? 

known as a "spiders" or "bots." Entire sites or specific pages can be selectively 
visited and indexed by a crawler. In alternative embodiments, subsets of each site 
referenced by a URL 5 rather than an entire site, may be crawled and compared for 
30 similarity. 
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At block 712, the data returned by the crawler is examined. The data may 
be referred to as the body of the URL and includes data from the site identified by 
the URL and all accessible pages of the site. It is determined if the data including 
text and other information contained in the body of the URL is sufficiently similar 
to the data contained in the body of the previous URL. Again, similarity may be 
determined b v an v " suitable method such as a statistical comparison of the textual 
content of each page. If there is sufficient, similarity, control proceeds to block 
714 and it is assumed that the URL is the same as the previous URL, The body of 
text and other information is assigned to the rest of the similar URLs. 

If, at block 706, it was determined that the selected URL was not similar to 
the preceding URL, or if at block 712 it was determined that the body of the URL 
was not similar enough to the body of the previous URL, control proceeds to block 
718. At block 718, the URL is added to a list of URLs to be crawled. At block 
720, all URLs on the list are crawled to retrieve and store information contained at 
the sites indicated by each URL, 

At block 716, the information from each crawled URL is loaded into the 
related searches database (also referred to as the free text database). The 
information is joined with search listing data already included in the related 
searches database. Thus, the method steps illustrated in FIG, 7 reduce the total 
amount of data contained in the related searches database by reducing the number 
of URLs that are crawled and stored. Duplicate URLs are eliminated from the 
process and near-duplicate URLs are checked for similarity of content. The result 
is reduced storage requirements for the resulting database and faster, more 
efficient searching on the database. This enhances user convenience by improving 
performance. 

From tne foregoing, it can be seen that the present invention provides an 
improved method and apparatus for producing related searches for presentation to 
a searcher searching in a pay for performance database. Related searches are 
performed in a related searches database which has been formed using the pay for 
performance database. The search results from the related searcher's database are 
ordered by relevancy for presentation to the user. Thus, if a user's initial search 
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was too narrow or too broad, the user has available related searches which may be 
used to produce more usable results. In addition, the related searches have been 
produced using search listings referenced by bidded search terms. This provides a 
benefit to advertisers who pay for advertising in the database search system. This 
increases the likelihood that an advertiser's web site will be visited by a searcher 
using the database system, 

vv Miit cl i^cLiLiL-Liicxi &ijLii_ji_jQirn.eiiL 01 lxic preoeni invention na? Dccn .miuvvu 
and described, modifications may be made. It is therefore intended in the 
appended claims to cover all such changes and modifications which fall within the 
true spirit and scope of the invention. 
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Source Code Appendix 



This is the core Java piece that actually 
5 performs the Free-text-search from the 

Free Text Search engine (Texis RDBMS from 
Thunders tone - EPI , Inc.), and post-processes 
the results, filtering on advertiser frequency 
of occurrence and theme ( 1 adult ■ -ness ) . 



15 



20 



package com. go2 . search . related; 

import j ava . ut i 1 . Vector ; 
import . j ava . util . Hash table ; 
import j ava .util . StringTokenizer ; 
imp or t j ava . rmi . Remo t eExc ep t i on ; 

import com . go 2 . texis . * ; 



/ * ~k 

25 * ©author Phil Rorex 

* ©version 
*/ 

class Callback implements TErrorMsglF { 
30 private static int err 115 = 0; 

public int getErrllS ( } { 
return ( errllS ) ; 
> 

public void ErrorMsgDelivery { String msg, int level, int 
3 5 ms gNumb e r ) { 

switch (msgNumber) { 
case 2 : { 

System . out . print In ( " FATAL : msg: + msg + " level: " + level 
msgNumber: 11 + msgNumber) ; 
40 System. exit (2) ; 

} 

case 100: break; 

case 115: erri!5++ ; break; 

default: 

45 System, out .print in { "UNUSUAL- : msg: " + msg + 11 level : 

level -f " msgNumber: " + msgNumber) ; 
} 

} 

} 

50 

/** 

* run as a stand.— alone uVrd, since the Free Text Searcher 

* being used is best connected with as a JNT -based C language 

* library interface API 
55 **/ 
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public class RelatedSearcherCore implements Runnable { 
//cache an instance of Texis server and Query 
private static Server texis = null; 
private static Query texis Query = null; 
private static Query texisPlurQuery = null; 

Query = null; 

private static long trmeCiock; 

//used to coordinate time-outs on extra long queries 
private static final Integer PRE_QUERY - new Integer (1) ; 
private static final Integer MID_OUERY = new Integer (2) ; 
private static final Integer POST_QUERY = new Integer (3); 

private static Integer semaphore = PRE__QUERY; 



// time out process 
Thread watchDog; 



/ / Thread, starts out waiting on eternity 

// may never be used if Core ( ) doesn't have timeout set for 
long global Time Gut - 0 ; 

//The magic adult flag 

//If a related search free-text search returns a row 
//which has this field set, it's automatically "themed" 
//as an adult-oriented related search 

//This particular "Magic data row" is pre-loaded with 
//all the "adult-oriented" terms which are typical 
//in this theme. Same should be done for CASINO FLAG 
/ /CURRENT_NEVwS_F Li AG , and any other theme desired, 
private static final int AD u L T_F LAG - 999999999; 

//How many plural i zed tries of the query 

/ / 1 1 g ^-^,1-. ~ ~- e ,^ r . /v ..i a n n 

// i_i o <z: i_i o w oca± un j_ wj. a xnu uxajL cliill u±Liiai 

// version of up to [square root of] MAX_PLURAL_QRY 
/ / terms 

private static final int MAX_PLURAL_QRY = 4; 
/ / Limit Texis to this mar" 1 ' •r*™**^ 

//This is the initial # of pre-f iltered. free-text 
//searched rows coming back from the search engine 
private static final int MARROWS = SO; 

// 

//controls the 'looseness' of the post-search filter 
//that filters out related searches based on *~ ^ ^ 
//derived data element of ( #-of -dif f erent -advert isers- 
/ /bidding on this related search term). Set to 0 is 
//this element means "how many times we can ignore seeing 
//the identical advertiser before we start ignoring 
//related searches bid on by him" 0 is strongest reject, 
//larger numbers reject less stringently (usu. not > than 
//l, if ratio of webpages : related search terms is > than 
//about 10 

private static final int ADVERT I S ER„THRE SHOLD = 0; 
//the SQL query used to talk to Texis (the FTS engine) 
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private static final String TEXIS_SOL = 
" SELECT " 

4- " Srank. " //Num get Row ( ) arg position 0 

+ "canon_cnt, 11 //Int get Row ( ) arg position 1 
5 4- M r a w_s e a r c h_ t ex t , !! //Stri get Row ( ) arg position 2 

+ " cannon „ssarch_^tGxt , " / /'Stri gstRcw { ) arg position 3 
4- « a elver t i s e r_i ds " //Stri get Row { ) a r g position 4 
+ " advert iser_cnt " //Int got Row ( ) arg position 5 
4- -FROM line_ad02 !! 
10 + "WHERE words " 

4- "LIKEP ? ORDER BY 1 desc, advertisement desc; :i ; 
// + "LIKEP ? ;"; 

// + :i LIKEP ? ORDER BY advert iser_cnt desc; u ; 

15 private static final String TEX I S__PLUR_S QL = 

"SELECT plural !! 
4- " FROM plurals " 
+ "WHERE singular ,! 

20 

private static final String TEXIS_ADULT_SQL - 

" SELECT cannon search text " 

4- "FROM adult " 
4- "WHERE words " 
25 4- " LIKE ? : " : 

private static Callback cb - new Callback ( ) ; 

pub lie vo i d init { Str i ng t ex i s H orne , 1 ong t i me Ou t ) { 
30 globalTimeOut = timeOut; 

i n i t ( t exi s H ome ) ; 

} 

public void init (String t exi s Home) { 
35 /** 

* Instantiate Tex is connection object and perform 

* Texis query initialization 

* Called one time to setup the Related Search query, 

* Must be called before findRelate is ever called. 

40 

//Perform Texis initialization and precache an instance 
//of Texis Server and Texis Query- 
try { 

Texis texis RDBMS = new Texis ( ) ; 
45 texis - (Server) texis RDBMS . creates erver ( texisHome) ; 

Vector n = new Vector (200); 
// Vector n = texis . get Noise ( ) ; 

n . addElement { "a") ; n . addElement ( " about !! ) ; 
50 n . addElement ( " after" ) ; 

n . addElement ( " again " ) ; n . addElement ( " ago " ) ; n . addElement ( " all " ) ; 
n . addElement ( " almost " ) ; n . addElement ( " also " ) ; 
n . addElement ( " always " ) j 

n. addElement ("am") ; n . addElement ( "an" ) ; n . addElement ( " and" ) ; 
55 n . addElement ( !! another !! ) ; n . addElement ( " any" } ; 

n . addElement ( " anybody" ) ; 

n . addElement ( " anyhow" ) ; n . addElement ( " anyone " ) ; 
n . addElement ( " anything" ) ; 
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n . addElement ( " anyway " ) ; n , addElement { " are " ) ; 
n . addElement ( "as ■ ) ; 

n. addElement ("at" ) ; n . addElement ( "away" ) ; n , addElement ( "be" ) ; 

n . addElement ( "became " ) ; 
n , addElement ( "because" ) ;n .addElement { "been" ) ; 

n . addElement ( "before" ) ; n , addElement ( "being" } ; 
n . addElement { "but " ) ; 

n . addElement ( " by" ) ; n . addElement ( " came " ) ; n . addElement ( " can " ) ; 

n . addElement ( " cannot !! ) ; n . addElement ( " com" ) ; 
n . addElement ( " come " ) ; 

n . addElement ( ■ could" ) ; n . addElement ( " de " } ; n . addElement ( " del !i ) 

n. addElement ( "der" ) ; n. addElement { "did" ) ; n . addElement ( "do" ) ; 

n . addElement ( 51 does " ) ; n . addElement ( " doing" } ; 
n , addElement { "done" } ; 

n . addElement { " down " ) ; n . addElement ( " each " } ; 
n . addElement ( !! else" ) ; 

n . addElement f " even" ) ; n . addElement ( " ever " ) ; 
n , addElement { " every" ) ; 

n . addElement ( " everyone " ) ; n . addElement { 11 everything " ) ; 
n . addElement ( " for" } ; 

n . addElement ( " from" ) ; n . addElement ( " front " } ; 
n . addElement ( " get " ) ; 

n . addElement ( " getting " ) ; n . addElement ( " go " ) ; 
n . addElement ( " goes " } ; 

n . addElement ( " going" ) ; n . addElement ( " gone " ) ; 
n . addElement ( 55 got " ) ; 

n . addElement { " gotten " } ; n , addElement ( " had" ) ; 
n . addElement ( "has " ) ; 

n . addElement { " have " ) ; n , addElement ( " having " ) ; 
n . addElement ( "he" ) ; 

n = addElement ( " her " ) ; n , addElement { " here " } ; n , addElement ( " him" ) 

n . addElement ( "his " } ; n . addElement ( "how" ) ; n . addElement ( " i " ) ; 

n , addElement ("if " } ; n . addElement ( " in " ) ; n . addElement ( " into " ) ; 

^ HiHUl nTYinn*- f " 1 O M * n nr-j/HTPl rtm/^-o +- / II A r~i v. S 4- It \ . ^ ^ T? 1 ^^,^^,4- / M 4 *- II \ . 

n . addElement ( 11 jpg " ) ; n . addElement ( " j ust" ) ; 
n . addElement ("last") ; 

n, addElement ( "least" ) ; n . addElement (" left " ) ; 
n . addElement ( " less " ) ; 

n. addElement ( "let" } ; n . addElement ( i[ like " ) ; 
n . addElement ( "make" ) ; 

n . addElement { "many" ) ; n . addElement { "may" ) ; 
n . addElement ( "maybe " ) ; 

n. addElement ( "me" ) ; n. addElement ( "mine" ) ; n. addElement { "more 18 } 

n. addElement ( "most" } ; n. addElement { "much" ) ; n . addElement ( "my" ) 

n . addElement ( "myself " } ; n . addElement ( "net " ) ; 
n . addElement { "never" ) ; 

n. addElement ( "no" ) ; n. addElement ( "none" ) ; n. addElement ( "not" ) ; 

n . addElement { "now" ) ; n- addElement ( "of " ) ; n . addElement ( "off " ) ; 

n . addElement ( " on " ) ; n . addElement ( " one " ) ; n . addElement ( " onto" } ; 

n . addElement ( " org" ) ; n . addElement ( M our" > ; 
n . addElement ( " ourselves " ) ; 

n . addElement ( " out " ) ; n . addElement ( " over" } ; n . addElement ( "per " ) 

n . addElement ( "put " 5 ; 
n. addElement ( "putting" ) ; n . addElement ( "same" ) ; 

n , a.dd.E 1 emen t ( " saw" ) ; n , addElement ("see") ; n . addElement { " seen" ) 

n . addElement ( " shall " ) ; n . addElement ( " she " ) ; 
n . addElement ( " should" ) ; 
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n . addSlement ( " so " } ; n - addElement ( " some » ) ; 
n . addElement ( " somebody" ) ; 

n . addSlement ( " someone " ) ; n . addElement { " something " ) ; 
n . addE 1 emen t ( " s t and " ) ; 

n . addElement { M such" ) ; n . addElement ( " sure" ) ; 
n . addElement ( " take" ) ; 

n . addElement ( " than !! ) ; n . addElement ( " that " ) ; 
n . addElement ( " the " ) ; 

n . addElement ( !! their f! ) ; n . addElement ( " them" ) ; 
n . addElement ( " then" ) ; 

n , addElement ( " there " ) ; n . addElement ( " these " ) ; 
n . addElement { " they" ) ; 

n . addElement ( " this" } ; n . addElement ( " those" ) ; 
n . addElement ( " through" } ; 

n . addElement ("till") ; n . addElement ( "to " ) ; n . addElement ( " too " ) 

n . addElement ( " two " ) ; n . addElement ( !! unless " } ; 
n . addElement ( "until " ) ; 

n. addElement ( "up" ) ; n. addElement ( "upon" ) ; n. addElement ( "us" } ; 

n . addElement ( "very" ) ; n . addElement ( "was !! } ; n . addElement ( " we" 5 

n . addElement { "went " ) ; n . addElement ( "were " } ; 
n . addElement ( "what " ) ; 

n - addElement ( "what ! s " } ; n . addElement ( "whatever" ) ; 
n . addElement ( "when" ) ; 

n . addElement ( "where" ) ; n . addElement ( "whether" } ; 
n . addElement ( "which" ) ; 

n . addE 1 emen t ( " whi i e ■ ) ; n . addE 1 emen t { " who " ) ; 
n . addElement ( "whoever" ) ; 

n . addElement ( "whom" ) ; n . addElement ( "whose" } ; 
n . addElement ( "why" ) ; 

n . addElement { "will " ) ; n . addElement { "with" ) ; 
n . addElement ( "within" ) ; 

n . addElement { "without " ) ; n . addElement ( "won ! t " ) ; 
n . addElement ( "would" ) • 

n . addElement f " wouldn • t " } ; n . addElement ( "www" 5 ; 
n . addElement ( "yet " ) ; 

n. addElement ( "you" ) ; n . addElement ( "your " ) ; 

texis . setNoise (n) ; 

texisQuery = (Query) texis . creaccQuery ( ) ; 

t exi s Plur Query = (Query) texis . cr eat eQuery( ) ; 

texisAdultQuery = (Query) texis . createOuery () ; 

/* 

* Query. api() ' s affect ALL queries, not just ones set on 

*/ 

texis Query , setlikepr o ws ( MAX ROW S ) ; 

texisQuery . al linear ( 0 ) ; 
texisQuery . alpostproc ( 0 ) ; 

texisQuery. prepSQL (TEX I S_SQL) ; 

t exi s Plur Query . prep SOL ( TEXIS_PLUR_SQL ) ; 

texisAdultQuery. prepSQL ( TEXI S_ADULT_SQL ) ; 

TErrorMsg . Regis terMsgDe livery ( cb) ; 

watchDog = new Thread (this ) ; 

watchDog = set Priority (Thread . NORM_PRIORITY + 1); 
watchDog . start { ) ; 
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} catch (TException te) { 
te . prints tackTrace ( ) ; 
throw new Run time Except! on { 

"Gould not initialize X'exis : Failed with: " + te.getMsgO 
c o de : " + t e , ge t Erro r Co de ( ) 
) ; 

} catch (RemoteException re) { 

throw new Run time Except ion { "Unexpected. RemoteException 

re) ; 

} 

} 

/ * * 

* Perform a Taxis related search query and package results ( 
any) 

* into an array of RelatedResults objects 
*/ 

public RelatedResult [ ] findRelated( String rawQuery, String 
canonQuery, int maxResul ts , int maxResul t Length ) 
throws Exception 

( 

try { 
return 

f indRelated ( rawQuery . canonQuery. maxResul ts t maxResul t Length ,2000) 
} catch (Exception e ) { 
e .printStackTrace ( ) ; 

throw new Exception ( 15 overloaded f indRelated" +e . getMessage ( ) ) 

} 

} 

public RelatedResult [ ] f indRelated { String rawQuery, String 
canonQuery, int maxRe suits , int maxResul tLength, long time Out ) 
throws RelatedSearchException 

{ 

/ /local var s 

Vector resultVector = new Vector!) ; 

Vector this Row = new Vector') ; 

RelatedResult [ j results - null; 

int resultCount - 0 ; 

int rank = 0 ; 

int canonCnt - 0 ; 

Integer advert iser_id - null; 

int a aver t i s erCn t = 0 ; 

String advert iserlds = null; 

String rawSearchText = null; 

String canonSearchText = null; 

Run t irne r t = Run time . get Run t ime ( ) ; 
long mem = rt . totalMemory ( ) ; 
long free = rt . f reeMemory ( ) ; 

//System, err, print In ( M totalMemory ( ) : " 4- mem) ' 
//System, err. println( ,f f reeMemory { ) : " + free); 

try { 

if (canonQuery == null) return (null ) ; 
Vector queryArgs = new Vector ( ) ; 
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String newQuery ; 

if i timeOut != 0) { 

// get the loop out of wait ( ) mode 
synchronized (watchDog) { 
globalTimeOut = time Out ; 
/ /System. err . print In ( 

System. currentTimeMillis ( ) 
// + 

// + timeCiock 

// + » = » 

/'/ + ( System. currentTimeMillis () - timeCiock) 
// + f! Core: setting MID_QUERY ,! ) ; 
semaphore = MID„0UERY; 

timeCiock = System. currentTimeMillis () ; 
/ / thread better be waiting on eternity 
watchDog. notify ( } ; 

} 

> 

/ / if no Raw Qu ery, probably have c anon i c a 1 i z e d versi on only 
if ( rawQuery 1— null) { 

/'/ usual serving site, don * t re~piuraiize , just use 

// raw query; 

newQuery = stripNoiseChars (rawQuery ) ; 
}eise{ 

/"/ only have a canonOuery to work with, so make 
// a rough approximation of a raw term to include in search 
/ / try and generate queries to cover up to MAX_PT ,T TR AT -_QRY 
sible 

// re-pluralized forms of the query 

newQuery = pluraiize ( stripNoiseChars (canonQuery) ) ; 
} //if 

i f ( newQuery = = nu 1 1 ) re turn nul 1 ; 
if ( i s Adu 1 1 ( n ewQue ry ) ) re turn nu 1 1 ; 

// Set up the (stack allocated) query parameters 
queryArgs - removeAllElements ( ) ; 
queryArgs . addElement (newQuery ) ; 

//perform JNI calls here 

t ex i s Que r y . s e t P ar am ( que ryAr g s ) ; 

texisOuery. exec SQL ( ) ; 

//Iterate over the rows 
String last Canon = rawQuery; 

Hashtable advertisers = new Kashtable (MAX_ROwS*20G ) 
Hashtable used = new Hashtable (MAX_R0WS) ; 
Vector resultSet = texisQuery . get Rows ( ) ; 
//Vector resultSet - getRowsLocal ( ) ; 

// make 2 passes, 

// first time de-dup on advertisers 

/ / — „ ,~ .43 i_ 4 i j _-I 3 

/ / ocuuiiu u-lauc: uijll !_ CJ.fciU.up 

// System, out. print in ("got rows : " + resultSet . size ()) ; 
for(int pass = 0; pass < 2 ; . pass++) { 

if ( result Count >= maxRe suits) ? break; 
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for ( irit i = 0; i < result Set . size ( ) ; { 
this Row = (Vector) result Set . element At { i ) ; 
if (thisRow. size ( ) =- 6} { 

rank = { (Number) ( thisRow. element At { 0 ) ) ) . intValue ( ) ; 

//System, out , print In ( thisRow, element At (0) = getClass ( ) .toStringO ) ; 

canonCnt = ( ( Integer) ( thisRow . element At { 1 ) ) ) . intValue ( ) ; 



rawSearchText - (String) thisRow . element At (2) ; 

canonSearchText = ( String ) thisRow . element At ( 3 ) ; 

advertiser Ids = {String) thisRow . element At ( 4 ) ; 

advertise r C n t - ( ( Integer) thisRow . element At (5) ) . intValue ( ) 

//Drop out early if we detect magic ADULT_FLAG 
if (advert iserCnt == ADUT.T_FL.AG) return null; 
if (canonCnt == ADULTJFLAG) return null; 

if (false) { 

Sys t em . out . print In ( " rank : " 

+ rank 

+ " cnt: !! 

+ canonCnt 

+ " rst : " 

+ rawSearchText 

+ « est: !! 

+ canonSearchText 

4- « aids : " 

+ advertiser Ids 

+ " adent: i; 

+ advert iserCnt 

5 ; 

} else { 

throw new P.elatedSearchException ( "Texis query failed, 
protocol violation"); 
} 

// De— dup the results, and also don't return a related 
// search term which canonically matches the original cruerv 
if ( ( I canonSearchText . equal slgnoreCase ( rawQuery) ) 
( ! canonSearchText . equals ( canonQuery ) ) 
{ I rawSearchText . equal slgnoreCase ( rawQuery) ) 
( i rawSearchText . equal slgnoreCase ( canonQuery) ) 
(canonSearchText . length ( ) <= maxResultLength) ) { 
/ /System. out . print In ( "got est: " + canonSearchText); 
//look for this advertiser in the hash table 
//if there, increment occur ranees count 
// and if above threshhold, we've seen enough 
// terms suggested by this advertiser, so go to 
/ / next term 

// if not seen this advertiser yet, put it in the 
// hash table and process 

StringTokenizer st = new StringTokenizer (advert iserlds , " 

) j 

I /if ( st . countTokens ( ) ! = advert iserCnt) { 
/ / Sys tern a out . print In ( " toks : " 
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/ / + st . count To kens ( ) 

+ "Cnt: " 
// + advert iserCnt ) ; 
/ / throw new RelatedSearchException ( 
/'/ "Texis query suspect, wrong advertiser count"); 
/ / } 



int dupAdvCnt = 0; 
10 boolean Next = false; 



// Parse all the advertiser ID ! s out of the returned row 
while (st .hasMoreTokens { ) } { 

Integer advertiserld = Integer .valueOf ( st .nextToken ()) ; 

//if { I advert i s er s . c on tains Key ( adve rtiserld) } 



/ / if this advertiser is new to us (over whole query) 
// put in the hash 

20 Integer cnt - (Integer) advertisers . get (advertiserld) ; 

if (cnt == null) 
{ 

advertisers .put (advertiserld, new Integer (0) ) ; 
}else{ 

25 // Seen this advertiser before, so increment his 

// tally 

advertisers .put (advertiserld, new 
Integer (cnt s intValue (5+1) ) : 

30 //System. out .println (advertiserld + " dups: " + 

{ cnt . intValue ( } +1 ) ) ; 

/ / If he's (now) past the threshhold, don't use 
// bidded term (yet) 

if (cnt. intValue () >= ADVERTISER_THRESHOLD) { 
35 Next = true; 

break ; 

} 

dupAdvCn t + + ; 

} 

40 } 

if (Next == true) { 

continue; 
}else { 

if ( I used . containsKey ( canonSearchText ) ) { 
45 used, put (canonSearchText t new Boolean (true) ) ; 

} 

} 

} else { 

if ( ! used .containsKey (canonSearchText) ) { 
50 used. put (canonSearchText , new Boolean ( true) ) ; 

} else { 

continue ; 

}' 

} 



//if (dupAdvCnt >= ADVERT I S ER__THRESHOLD ) { 
/ /continue; 

Sys tern .out, println ( i! dups : !! + dupAdvCnt ) ; 
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if (pass == 0} { 

// first time thru see if we've used this advertiser 
Integer cnt = ( Integer) advertisers , get { advert iser_id) ; 

•5 -P t r~nr\ 1 r — — mil 1 1 f 

J- \\—J..k.«- — -X. / L 

advertisers .put (advertiser^ id, new Integer ( 0) ) ; 
}else{ 

advertisers .put ( adver t iser_i d , new 
Integer ( cnt . intValue ()+!)); 

if (cnt . intValue { 5 >= ADVERT I SER__THRE SHOLD ) { 
continue ; 

} 

} 

if ( i used. c on tains Key ( canonS ear chText) ) { 

used . put ( canonSearchText , new Boolean ( true ) ) : 

} 

} else { 

// this is a second (or more) time thru. 
// see if we've already used this term 
if ( ! used. containsKey (canonSearchText ) } { 

used. put ( canonSearchText , new Boolean ( true) ) ; 
} else { 

continue; 

} 

} 

if (resultCount < maxResults) { 
resultVector . 

a ddE 1 emen t ( new Re 1 a t edRe su 1 1 ( r awS ear chText , 
RelatedResuit .NON_CACHED) ) ; 

result C ount + + ; 

break; 
} //if -else 
}//if 
}//for 

v / / -f=^-K> 

ill J-WJ. 

} catch (TException te) { 

throw new RelatedSearchException ( "Texis interface failed with: 
4- te.getMsgO , te) ; 

} catch ( Thr owabl e t ) { 
t . print StackTrace ( ) ; 

throw new RelatedSearchException ( "Unexpected Texis failure with 
11 + t .getMessage ( ) , t) ; 
} finally { 

if (timeOut != 0) { 

synchronized (watchDog) { 
/ / System . err . print In ( 
// System. currentTimeMillis ( ) 
// + 

+ timeClock 
// + » = " 

// + (System. currentTimeMillis () - timeClock) 
// + ■ Core: setting POST_QUERY !! ) ; 
semaphore = POST_QUERY; 

timeClock = System. currentTimeMillis ( 5 ; 
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// cause thread to wait on eternity 

globalTimeOut - 0; 

w a t c hDo g , noti f y ( ) ; 

/ / System . err . print In ( 



// 
// 
// 

// 
// 



System, currentTimeMillis { ) 
+ » _ " 

+ timeClock 

+ ( System . currentTimeMi 1 1 i s ( ) - t imeClock ) 
+ " Core: done with calling notify"); 



> 

//System. out . print In (" INFO : 115 err ' ' s : » + cb. getErrliS ( ) ) ; 
if (cb.getErrllS ( ) > 100) { 

System . out . print In ( " FATAL : Too many ErrllS's"); 

System. exit (3 ) ; 

} 

if (resuItVector . size { } == 0) 

return null; 
else { 

result Vector . copy In to ( results = new 
RelatedResult [resuItVector . size ( ) ] ) ; 
return results; 

'} 

} 

private String stripNoiseChars (String term) { 
// Clean up the query a bit 

if ( term . length ( ) < 2 ) return (null ) ; 

char [ ] but = new char ! term . length ( ) ] ; 
int firstChar = 0; 

term. getChars ( 0 , buf . length, buf , 0) ; 



fnr m' n r 1 — Q- t "hi i -F 1 onrrrh ■ i 4.4.) / 

if (buf [i] < 0x20 |! buf [i] > 0x7e) return (null ) ; 

switch (buf[i]J { 
case 



case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
case 
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case ('{'}: 
case { 1 [ ' ) - 
case ( 1 } ' ) : 
case ( 1 ] ' ) : 
case ('!'}: 
case ( ! \ ' ! ) : 
case ( ' : 1 ) : 
case (';'): 
case (•"*): 
case ( 5 \\ 8 ) : 
case ( 1 > ' ) : 
case (','): 
case ( 1 < ' ) : 
case ('.'): 
case ( 1 / 5 ) : 
case ('?'): 
case ( 1 1 ) : { 
buf[i] = ' 

/ / System . out . print In ( " i : !i 

//+ i + "firstChar:" -f firstChar 

//+ "setting but [i] : 11 

// + ( String. valueOf (buf [i] ) ) + !! setting to space"}; 
if (firstChar == i) firstChar = i+1; 

} 
} 

} 

// only spaces left 

if (firstChar == buf. length) return (null) ; 

term = term == null ? null : 
String. valueOf (buf , firstChar, buf . length-f irstChar } . trirrW ) ; 

switch ( term. length ( } ) { 
case 0 : 

case 1: return (null) ; 
default: { 

switch (buf [firstChar j ) { 

case ('h') : 

case ( ' H ' ) : 

case ('w') ■ 

case ( » W ) : { 

String lower Term = term. toDowerCase { ) ; 

// use the lease vers of the string for testing 

// but make sure to SET the original string to return 

if ( 1 ower Term . s t arts With ( " h 1 1 p www " ) } term = 
term . subs t ring (10) ; 

else if (lowerTerm.startsWithf "http www")) term = 
term . substring ( 3 ) ; 

else if (1 ower Te rm. start s wi t h ( " h t tp www " ) ) te rm = 
term . substring ( 8) ; 

else if (lowerTernwstartsWith( "hhttp www")) term = 
term . substring ( 11 } ; 

else if (lowerTerm. start sWith ( "http « ) ) term = 
term . subs tring (5) ; 

else if (lowerTerm. &tartsWith( "http" ) ) term = 
term. substring (4) ; 
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else if { lowerTerm. startsWith ( "www " ) ) term = 
term . substring ( 4 ) ; 

else if { 1 owe r Term -start s W i t h ( " www " ) ) term - 
term . substring ( 3 ) ; 

} 

) 

} 

} 

switch ( term. length () ) { 
case 0 : 
case 1 : 
default ; 
switch 



return (null) ; 
{ 

( term . char At ( term . length { ) • 



1) ) { 



case 


( 'm» ) 


case 


( 5 M 1 ) 


case 


( ! t ' ) 


case 


t'TM 


case 


( } g ! 5 


case 


( ' G ' ) 


case 


( 1 f ) 


case 


{ »F' ) 



{ 

String lowerTerm = term . toLowerCase ( } 



i f ( lower-Term . endsWith ( !i dot com" } ) 
term . substring ( 0 , term . length ( ) -8 ) • : null ; 

else if ( 1 o we r T e rm . ends Wi t h ( " dote om " ) ) 
7 ? term . substring ( 0 , term . length (5-7 } : null ; 



term = term. length ( ) > 8 
term = term. length ( ) 



else if ( 1 owe r Term . ends with ( " 
? term, substring { 0 , term, length ( ) -4) 

else if ( lowerTerm. endsWith ( " 
? term . substring { 0 . term . length ( ) -4 5 

else i f ( 1 ov/erTerm . ends Wi th ( " 
? term = substring ( 0 f term . length ( ) -4 ) 



I J.UWCX ICiUl. 



? term - substring ( 0 term . length { ) - 4 ) 
else i f ( 1 owe r T e rm . ends Wi th ( 
? term, substring ( 0 , term. length ( ) -4} 

} 

> 

} 



} 

/ /Debug : System. out , print 1 n ( ' 
return (term — null ? null : 



com 

nui 1 ; 
net" ) } 

null; 
org" ) } 

null ; 

y j.l ) ) 

null ; 

jpg") ) 

nui 1 ; 



)) term = term. length ( ) 
term = term. length ( } 



term = 



term = 



term 



term. length ( ) 
term . length ( ) 
term. length ( } 



term: [" + term, trim { ) + !? ] !! ) 
term. trim( ) ) ; 



} 



P r iva t e bo o lean i sAdu It ( S t r ing Que ry ) 
throws Re 1 a t e dS ear chExc eo t i on 

{ 

if (query == null) return false; 
Vector queryArgs = new vector ( ) ; 
Vector this Row = new Vector {); 
queryArgs . addElement (query) ; 

try { 

/ /per f o rm JMI calls 

texisAdultOuery. set Param( queryArgs) ; 
texisAdultQuery . exec SQL ( ) ; 

if ( (this Row = texisAdultOuery . get Row ( ) ) 5 size{) 



!= 0) { 
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return (true) ; 
}else{ 

return (false) r 

} 

5 } catch (TException te) { 

throw new RelatedSearchException ( "Texis interface failed with; !! + 

te <= uSfcMSy ( ) ; tsj • 

} catch (RemoteException re) { 
^ throw new RelatedSearchException ( "Got a Remote Except ion that 

10 should never occur: " + re) : 

} 

} 

private String pluralize (String token) 
1 5 throws Re 1 a t edS ear chExcep t i on 

{ 

if (token == null) return null; 
Vector queryArgs = new Vector (); 
String pluraiToken; 
20 Vector this Row = new Vector ( ) ; 



otx inyTukHuizei scu = new String'rokenizer ( token, " " ) ; 
String [ ] terms = new String [stO . countTokens ( ) ] ; 
String [] fuliQuery = new String [MAX^PLURAL^QRY] ; 
25 int ruiiOueryCnt = 0; 

// Iterate over each token to see if there's a plural version 
for (int eleO = 0; stO . hasMoreTokens ( ) ; ele0+-f ) { 
terras [eieu ] ~ stO . nextToken ( ) ; 

} 

30 

for (int element = 0; element < terms. length && fuIiQueryCnt < 
™AX_PLURAL_QRY; element++){ 

// Do plurals lookup on this term from texis db 
35 queryArgs , removeAilEiements ( } ; 

queryArgs . addElement ( terms [element 3 ) ; 

try { 

/ / p e r £ o rm J NX calls 
40 texisPlurOuery.setParam (queryArgs) ; 

texisPiurQuery . execSQL ( ) ; 

//retrieve the row 

if ( ( this Row = t ex i s P 1 ar Query . getRow ( ) ) . size ( ) 1=0) { 
45 String term = null; 

// loop thru the terms 

for (int eiel = 0; elel < terms . length; elel+4-) { 
if ( elel == element)! 
if (elel ss 0) { 

50 term s (String) ( thisRow. element At ( 0 )) ; 

} else { 

term += !i » + ( String) { thisRow. element At ( 0 )) r 
> 

}else{ 

•^■3 if (elel == 0) 

term = terms [eiel] ; 
else 

term 4== ■ » + terms [eiel]; 
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} 

} 

fuiiQueryi fullOueryCnt] = term; 
f ulIQueryCnt4-+ ; 

5 } 

} catch (t hixcept ion te) { 
throw new RelatedSearchException ( "Texis interface failed with: 
+ te.getMsgO, te) ; 

} catch (RemoteException re) { 
10 throw new R.elatedSearchException ( " Got a 

RemoteException that should never occur: !! * re);} 
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} 



// Build the new expanded query 
if (fullOueryCnt > 0) { 

plural To ken = " ( !S + token; 

for(int i = 0; i < fullOueryCnt; i++) { 
20 pluralToken = pluralToken + 11 f - + fullQuery [i] ; 

} 

pluralToken = pluralToken + " ) 11 ; 
} else { 

pluralToken = token; 

25 > 

return (pluralToken) ; 

> 

public Vector getRowsLocal ( ) throws TException, RemoteException { 
Vector set = new Vector (); 
30 int e ; 

synchr oni zed ( API Token s Lock ) { 
while (true) { 

Ve c t o r r o w = new Ve ctor ( ) ; 

•v- rtT»T — +- -i n ^ i ^ -v— * r ^N-/~v 4-D /m.t / \ • 

luvv — ucaaduucx v . y t; UiVUW \ I f 

35 if (row. size ( ) — — 0) 

break; 
s e t . a ddE 1 emen t ( row } ; 

} 

> 

40 return set; 

} 

pub lie synchr on i z e d void run ( ) { 
while (true) { 
45 try{ 

/ / start o ur t i me o u t 
sync hr on i z e d ( wa t chDog ) { 
/ / System . err . print In ( 

System. currentTimeMillis ( ) 
50 // + ■•-» 

/ / + timeClock 
// + - !! ^ » 

+ (System, currentTimeMillis ( ) - timeClock) 
// + " run: starting wait of " 
/ / + yl oba 1 T ime Ou t ) ; 
watchDog . wait { globalTimeOut ) ; 
/ / just got woke up, 
// see why 
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if ( semaphore . equals ( PRE_QUERY ) ) { 
/ / System . err * print In ( 
/ / System. currentTimeMiliis ( ) 
// + »-« 
5 // + timeCiock 

// + » = » 

// + (System. currentTimeMiliis ( ) - timeCiock) 

// + !! run: got PRE__QUERY " > ; 

continue; 

10 } else if ( semaphore .equals ( POST QUERY ) ) { 

/ / Sy s t em , err .println ( 

/ / System. currentTimeMiliis ( ) 

// 4- rt -» 

/ / + timeCiock 
15 // + « = ■ 

// + (System. currentTimeMiliis ( ) - timeCiock) 

// + !t run: got POST_QUERY" ) ; 

continue; 

} else if ( semaphore . equals (MID__QUERY) ) { 
20 if '(System. currentTimeMiliis ( ) - timeCiock >= globalTimeOut) { 

// we timed out, but semaphore wasn't 
/ / set. so hose ourselves 
System, err .println ( 

System, currentTimeMiliis ( 5 
25 + 

+ timeCiock 

_1_ ii _ ii 

+ ( System. currentTimeMiliis ( ) - timeCiock) 
+ " Fatal: timeout " 
30 •f globalTimeOut 

4- " usee exceeded 11 ) ; 
System. exit (1) ; 
}else{ 

/ / Svshem ovv nrini-ln ( 

r t — J — <- — • " -»- • t^- 1 LAJ. I- «I.Ai, v 

35 / / Sys tern . curren tTimeMi 1 1 is ( ) 

// + 

// 4- timeCiock 
//+"=» 

// + (System, currentTimeMiliis () - timeCiock) 
40 // + " run: got MID QUERY, but OK!"}; 

} 

} else { 

Sys tern , err . println ( 

System. currentTimeMiliis ( } 

+ timeCiock 

+ ( System. currentTimeMiliis ( ) - timeCiock) 
+ " run: ARGKH got no_QUERY, Hmmmmm ! " ) ; 

50 } 

> 

} catch (Exception e) { 

System. err .println ( "got wait ( 5 exception" ) ; 

} 

55 } 
} 
} 



45 



WO 01/90947 



37 



PCT/US0i/i6i6i 



10 



20 



25 



40 



45 



50 



55 



The following code is used to implement the 

V^Ci^±ie:^l. 1,Cl3UJ,UO _l V_J V_7 jv UU . -1 L X g l_. LU S«« 

if we've seen this related search before, to 
save time and not do the algorithmic lookup 
during the related search execution. 



package core,. go'Z . search, related ; 

//import atg 
import atg .nucleus .Generic Service; 
import atg . nucleus . ServiceException ; 



15 import atg . service . resourcepool . JDBCConnectionPool ; 

import atg . service . resourcepool . ResourceObj ect ; 
import atg . service , resourcepool . ResourcePoolException ; 



impor t j ava . rmi . Rerao teExcep t i on ; 

import j ava .net.*; 
import j ava . io . * ; 

import j ava . sql . * ; 

irnpo r t j ava . ut i 1 . Vec tor; 

J -k * 

* This is the top level inter fa 
30 * tjyfcitejuL it i& iti^cnit tu be used as a dynamo service available 

* to other dynamo services 
* 

* / 
/ 

//public class RelatedSearcherlmpl extends GenericRMI Service 
35 public class RelatedSearcherlmpl extends GenericService 

implements RelatedSearcher 

{ 

//my pool of Texis/UDP 

private TexisUDPConnectionPool texisUDPConnectionPool ; 



//my pool of connections to Oracle cache 

private JDBCConnectionPool relatedCacheConnectionPool ; 



/ / t«:t- at- • 



3to.es properties 



requestuount - 



private int oracleCacheKi ts = 0 ; 

private int texisRequests - 0; 

private int texisTimeoutMillis = 0; 

private int si owTexis Request Count = 0; 

//private constants 

private static String CACHE_SQL - " SELECT * FROM REL_S EARCH 
WHERE CANON_QUERY= ? " ; 

private static int BUFFER_8I2E = 512; 

/ / p a r ame ters 

private boolean texisEnabled = false; 
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private boolean oracleEnabled - false; 
private boolean systemEnabled = false; 
private long curnmulativeOracleTirrie - 0; 
private 1 ong cummu 1 a t iveTcxi sT ime = 0 ; 

/ * * 

* Create and export and instance of ReiatedSearcher over RMI 

*/ 

public RelatedSearcherlmpl ( ) throws RernoteException { 
super ( ) ; 



/ / java . rmi . registry . Locate Registry' . createRegistry ( 1111 ) . rebind ( 

"ReiatedSearcher", this) ; 
> 

/ * * 

15 * This method was created in Visual Age . 

* ©return RelatedResult [ ] 

* ©param canonQuery j ava . lang . String 

* ©par am maxResults int 

* ©oar am maxLenath int 

20 

private RelatedResult [ ] findFromCache (String canonQuery. int 
maxResults, int maxLength) 

throws RelatedSearchException 

( 

25 Vector resultVector = new Vector (); 

RelatedResult [ ] results = null; 
PreparedStatement ns - null; 
ResultSet rs = null; 
try { 

30 //Get a Connection 

ResourceObject resource = null; 
try { 

resource = getRelatedCacheConnectionPool (). checkout 
( ge t Ab solut eName ( ) ) ; 

35 Connection conn = (Connection) resource .get Re source 

() ; 

boolean success = false; 
try { 

//Here 1 s where we get the goods from Oracle 
40 ps = conn .prepares tat ement (CACHE_SQLi} ; 

ps . set String ( 1 , canonQuery) ; 
rs - ps . executeQuery { ) ; 

//prime the cursor to point to the one and 

only row we 

45 //expect from Oracle if now matching rows 



were found 



50 something 



//then we'll simply drop thru to the end 
if (rs .next ( ) ) { 

//Extract the data we need if there was 



int numTerms = rs , get Int ( 2 5 ; 
if (numTerms == 0) 

//The cache tells us that there 

won ' t be 

55 //any results so we'll bail early 

throw new 

RelatedSearchException ( "No related Results"); 

int cacheFlag - rs .get Int {3} ; 
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//iterate over results retrieving up to 

//of those of them that are maxLength 

int resultCount - 0,- rowCount = 0; 
while (resultCount < maxResults 

String term = rs . getString ( 4 + 

rowCount ) ; 

10 //push this term into the result 



maxResults 
or smaller 

rowCount < numTerms } { 



vector if its good 



if ( term . length. ( ) < = maxLength ) { 
resultVector .addElement (new 



RelatedResult (term, cacheFlacr) ) ; 
15 resultuount++ ; 

} 

rowCount++ ; 

} 

} 

20 conn . commit { ) ; 

success - true; 
}finally{ 

//Cleanup result set 
if (rs ! = null) 
.25 rs . close ( } ; 

//Cleanup prepared statement 
if (ps != null) 

ps . close ( } ; 
//Cleanup connection 
30 if (! success conn != null) conn . rollback 

() ; 

}/ /try- finally 
} finally { 

// Check the Connection back in 
35 if {resource ! = null) 

getRelatedCacheConnectionPool () . checkln 

(resource) ; 

}/ /try- finally 
} catch { Resourc eFoolExcept ion exc) { 
40 if (isLoggingError ()) { 

logError ("unable to get Oracle cache connection", 

exc) ; 

} 

throw new RelatedSearchExcept ion ( "Unable to get Oracle 
45 cache connection", exc); 

} catch (SOL Except ion se) { 

if (isLoggingError ()) { 

logError ("Interface with Oracle cache failed", 

se) ; 

50 } 

throw new RelatedSearchExcept ion (" Interface with Oracle 
cache failed" , se) ; 
} //try- catch 

if (resultVector. size () == 0) 
55 return null; 

else { 

resultVector .copylnto (results - new 
RelatedResult [resultVector. size () ] ) : 
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return results; 



15 



20 



30 



55 



} 

I * * 



5 * Communicate to Tex is thru TexisConnectionPool 

* ©return RelatedResult [ ] 

* eparam c anonQuery j ava . 1 ang . S t r i ng 

* ©par am maxRe suits int 
©par am maxLength int 

10 

private RelatedResult [ ] f indFromUDPTexis ( String rawQuery, String 
canonQuery, int maxResults, int maxLength) 
throv/s RelatedSearchException 



{ 



RelatedResult [ ] results - null; 
//Get a UDPTexisConnection 
ResourceObject resource ~ null; 
TexisUDPConnection tc = null; 



try { 

resource = getTexisUDPConnectionPool (). checkout 
(getAbsoluteName () 5 ; 

tc - (TexisUDPConnection) resource . getResource ( ) ; 
25 DatagramSocket socket - tc a get Socket ( } * 



time 



//do this at run time to be able to switch Dynamo at run 
socket . setSoTirneout (getTexisTimeoutMiliis ( ) ) ; 
//package data to send 

TexisRequest request = new TexisRsquest () ; 
r e que s t . s e t RawQu e ry ( rawQuery } ; 
re cues l . ss Luanonuusrv i canonwuerv* j : 
35 request . setMaxResults (maxResults) ; 

request . se t MaxC ha r s ( maxLeng th ) ; 

request . setSequenceNumber ( ++tc . sequenceNumber) ; 
request . set Timeout (getTexisTimeoutMiliis ( 5 ) ; 

40 By t eArrayOu tpu t S t ream foa o s = new By t e Array Ou tpu t Stream ( } ; 

Ob j ectOutput Stream ous = new Ob jectQut put Stream (baos) ; 
ous . wr it eObject (request) ; 
ous. flush ( ) ; 
baos .close ( ) ; 

45 byte [ ] sendData = baos . toByteArray ( ) ; 

//send it off to the server 
if ( isLoggingDebug ( ) ) { 

logDebug ( "About to send to Tex is at endpoint : " + 
tc.getHostO + " : " + tc .getPort ()) ; 
50 } 

//send it 

DatagramPacket sendPacket = new DatagramPacket { sendData , 
sendData . length, tc . getHost { ) , tc. getPort ( ) ) ; 
socket. send (sendPacket) ; 



//wait for a reply upto timeOut milliseconds 
long st art wait = System. currentTimeMillis 0 ; 
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while (true) { //pull off inboud packets and check them 
the the right sequenceNumber 

/ / throws a j ava . io . In t err up t edIOExep t i on on t imeou t 
DatagramPacket receive Packet = new 
5 DatagramPacket (new byte [BUFFER__SISE] , 3UFFER_SIZE) ; 

socket . receive (receivsPackst ) ; 

Ob j ec t I npu t S t ream o i s — new Ob ^ e c t X n^u t S t ream ( n e w 
ByteArray Input Stream (receivePacket . getData ( > ) } ; 

Texis Response response = 
10 (Texis Response) ois . readObject ( ) ; 

ois . close ( ) ; 

i f ( response - gets equenc eNurnhe r ( ) i = 
t c . s equenc eNumbe r ) { 

//we got a stale response 
15 long midpoint = System . current TimeMil lis { ) ; 

int remainder = ( int ) ( getTexisTimeoutMillis ( ) 
- (midpoint - start Wait) ) ; 

if (remainder > 0) { //if we can still wait 
some more before a timeout 
20 //reset socket timeOut to the remaining 

time 

socket . setSoTimeout (remainder) ; 
} else { 

//give up at this point 
25 break; 

} 

} / / i f -wrong- s equenc e -number 
else { 

results = response . yet Re suits ( ) ; 
30 break; 

} 

} //while 

} catch (ResourcePooiException rpe) { 
35 if (isLoggingError £)) { 

logError ("Unable to get or- check in a Texis 

connection" f rpe) ; 

} 

} ca t ch ( C las sNo t FoundExc ep t i on cn f e ) { 
40 it ( isIioggingError ()) { 

logError ("Class not found Exception", cnfe) ; 

} 

} catch (SocketExcept ion se) { 

if (isLoggingError ()) { 
4 - J logError ("Socket Exception talking to Texis",. se) ; 

} 

} catch (StreamGorruptedException see) { 
if (isLoggingError ()) { 

logError ("Corrupted return from Texis", see) ; 

50 . } 

} catch (InterruptedlOException ioie) { * 
- if (isLoggingDebug() ) { 

logDebug( "Timed out talking to Texis", ioie) ; 

} 

33 } catch ( TOException ioe) { 

if ( isLoggingDebug ( ) ) { 

logDebug( "Timed out talking to Texis", ioe) ; 

} 
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} 

finally { 

// Check the Connection back in if we got it in the first 

place 

5 try { 

i f ( re s our c e 1 = null ) 

getTexisUDPConnectionPool ( ) . check In 

(resource) ; 

} catch (Re source Pool Except ion rpe) { /* ignore this one */ 

io > 

return results; 

} 

15 /** 

* ©return RelatedResult [ 3 - an array of RelatedResult objects 

* which is ordered by relebance from high to low or null if the 
system is di sable or no 

* related results were found 

20 * '3 par am rawQuery java . lang . String - raw query for which related 

searches are needed 

* ©param canonQuery java . lang . String - canonocalized for of the raw 
query 

* y P a ram maxResurts int *~ max i mum number of results requested 
25 * ©param maxRe s u 1 1 L engh t int - maximum lenght of a result in 

characters 

*/ 

public RelatedResult [ ] findRelated ( String rawQuery, String 
canonQuery,- int maxResults, int maxResultLength) 
30 //throws 
P.elafc edSsa.rchExc6 n t ion 

throws 

Re 1 a t edS earchEx caption, Remo t eExc ep t i on 

( 



35 



40 



requestCount++ ; 

//Return fast if system is disabled 
if ( ! getSystemSnabled ( ) ) 
return null; 

RelatedResult [ ] results = null; 



//first try getting data from the Oracle pool (if enabled) 
if (getOracleEnabied( ) ) { 
45 try { 

long start-Oracle = System . current TimeMi ills () ; 
//keep timing stats 

results = . findFromCac he {canonQuery, maxResults, 

maxResultLength) ; 

50 oracleCacheHi ts++ ; //fixed statistics bug 

c ummu 1 a t i ve Or a c 1 eT ime + ~ 
(System. currentTimeMillis ( ) - startOracle) ; 

} catch (RelatedSearchException rse) { 

//If Oracle told us that this search has no related 
55 //i.e. editorially-excluded porn, then drop out 

early 

if (rse.getRootCause() == null) { 
return null; 



WO 01/90947 



43 



PCT/US0i/i6i6i 



10 



} else { //log it otherwise for post mortem 
if (isLoggingError ( } ) 

logError ( "Failed to interface to Oracle 

cache ; will try Texis", rse) ; 

} 

} catch (Exception e) { 

if (isLoggingError ( ) ) 

logError ( "Failed to interface to Oracle", e) ; 

} 

}//if Oracle enabled 



//if unsuccessful! then try Texis pool if enabled 
if (getTexisEnablsdl } results == null) { 
15 try { 

long startTexisOuery = System, cur rent TirneMi 11 is () ; 
//keep texis timing stats 

texisRequests++ ; 

results = findFr omUDPT ex i s { r awQue ry t c anonQue ry , 
20 maxRe suits, maxRe s u 1 1 Leng t h ) ; 

1 ong t ex i s QueryMi His = Syst em . c urren t TimeM illis { ) 

- st art Texis Query; 

/ /log abnorma 1 iy long rsqu est t ime 
if { texisQueryMillis > getTexisTimeoutMillis ( ) ) 
25 si owT ex i s Re que stC oun t + + ; 

cummu 1 a t i veTex i sT ime += t exi sQueryMi His; 
} catch (Exception e) { 

if (isLoggingError ( ) ) 

logError ( "Texis interface failed with: " + 

30 e , getMessage ( ) , e) ; 

} 

}//if texis Enabled 

35 return results : 

} 

/ * * 

* Stats accessor 

* ©return Strina 

40 

public String getCuirimulativeOracleTime ( ) { 

return ( c ummu 1 a t ive Or a c 1 eT ime / 1000. 0) 4- if sec ends " ; 

T 
/ 

/ * * 

45 * Stats accessor 

* ©return long 
*/ 

public String getCummulativeTexisTime { } { 

return (cummulativeTexisTime / 1000.0) + " seconds"; 

50 } 

/ * * 

* Stats accessor 

* ©return int 

00 pu.dj.il c int ge tOracleCacheKr t s ( ) { 

return oracleCacheHits ; 

} 

/ 



WO 01/90947 



44 



PCT/US0i/i6i6i 



* Stats accessor 

* ©return boolean 

*/ 

public boolean getOracleEnabledf ) { 
5 return oracleEnabled; 

} 

/ * * 

* Accessor for relatedCacheCcnnecticnPooi 

* ©return atg . service - resourcepool . JDBCConnectionPool 

10 

public JDBCConnectionPool getReiatedCacheConnectionPool ( ) { 
return relatedCacheGonnectionPooI ; 

} 

/ * * 

15 * Stats accessor 

* ©return int 

* / 

public int get Request Count ( 5 { 
return requestCount ; 

20 } 

/ -k k 

* Stat-s accessor 

* ©return int 
*/ 

25 public int getSlowTexisRequestCount ( ) { 

return s 1 owTexi sReques tCount ; 

} 

* Stats accessor 
30 * ©return boolean 

*/ 

public boolean getSys temEnabled ( ) { 
re turn sys t emEnabled ; 

} 

35 /** 

* Stats accessor 

* ©return boolean 
*/ 

40 return t ex is Enabled; 

} 

I ~k -k 

* stats accessor 

* ©return int 
45 */ 

public int g e tTexi sReques t s ( ) { 
return texisReguests ; 

} 

/ * -k 

50 * configu pa ram accessor 

* ©return int 
*/ 

public int getTexisTimeoutMillis { ) { 
return texisTimeoutMillis ; 

55 } 

/ k -k 

* This method was created in visual Age . 

* ©return com . go2 . search. related. TexisUDPConnectionPool 
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*/ 

public TexisUDPConnectionPool getTexisUDPConnectionPool ( ) { 
return texisUDPConnectionPool ; 

} 

5 / * * 

* mutator 

* (ipararn nswValue boolean 
*/ 

public void setOracleEnabled (boolean newValue) { 
10 this .oracieEnabled - newValue; 

} 

I * * 

* Mutator for relatedCacheConnectionPool 

* iparam newValue atg . service . resourcepool . JDBCConnectionPool 

15 

public void setRelatedCacheConnectionPool (uDBCConnectionPooi 
newValue) { 

this . relatedCacheConnectionPool = newValue ; 

} 

20 

* mutator 

* (iparam newValue boolean 
*/ 

public void set SystemEnabled (boolean newValue) { 
25 this . systemEnabled = newValue: 

} 

I * * 

* mutator 

* ©par am newValue boolean 

30 

public void setTexisEnabled (boolean newValue) { 
this . texis Enabled = n ewVa 1 u e ; 

} 

35 - parameter mutator 

* Spar am newValue int 
*/ 

public void setTexisTimeoutMillis { int newValue) { 
this . texisTimeoutMillis = newValue; 

40 } 

I ± -k 

* This method was created in Visual Age . 

* (iparam newValue com. go2 . search. related. TexisUDPConnectionPool 

45 public void setTexisUDPConnectionPool (TexisUDPConnectionPool 

newValue) { 

this . texisUDPConnectionPool = newValue; 

} 
> 



WO 01/90947 



46 



PCT/US0i/i6i6i 



The following code is control code that controls the 

^3 ; ^ ~ ^ +r ^ . ^^~v^ 

\JL U-I L Ij^J JL I iy UJ. CD t-: GL JL o JL J, 

listing database, loading the crawled text, and 
invert ed- indexing 

all the related search indexing, including building the 
' derived-data 1 elements : 



#!/bin/ksh -x 

export PATH= / usr / local/ morph 3 /bin : . : $ PATH 
#. . /.zshrc 

export TMP - / export / home /goto / trap 
export TEMP- $ TMP 
export TEMPDI R= $ TMP 
export TMPDIR=$TMP 

TMPTA3LE=I ine adO 
TM? TABLE 2 = 1 ine_ad 
TBRMSTABLE= terms 

INC=02 

NEWTABLE= 1 ine_ad$ { IMC } 

C RAWLDATA= / home /goto/rs/ DONE / ALL . UNI Q 
CRAWLTABLE=1 ine_ad4 
S POOL- / home / goto/list 

DB = / home /goto / craw 1 db 

############### 
Log { } { 

echo ' \n#### 1 (date "+%m/%d %K': %M: %S" ) : "${*}" '#### ! 

} 

log 0, t import crawled data 

Log 0 . 1 Create line__ad4 and unique index in preparation for 1 crawl 1 

import 

tsqi -d $DB « i 
drop table line_ad4; 
create table line ad4 ( 

id counter, 

adjar 1 varchar (300) , 

c r awl title varchar (750) , 

crawlmeta varchar ( 5 

crawl body varchar ( 8 0 0 0 ) 
) ; 

drop index idx4ad_url; 

create unique index idx4ad_url on line_ad4 (ad_url) ; 
I 

t import -database $DB -table $ C RAWLT ABLE -s 
/home/ goto/rs /DONE/ crawl . sch -file $ CRAWLDATA 
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Log 1. extract iine_ads from live_ADMN into column delimited spool 
file 

umpadia $ SPOOL 

Log 2. t import database $D3 -table $TMP TABLE -s newrs . sch -file 
$ {SPOOL} 

t import -database $DB -table $TMPTABLE -s newrs . sch -file $ {SPOOL} 

Log 3 . build canon index on STMPTABLE 
## 

tsql -d ${DB} «! 
drop index idxOcst; 

create index idxOcst on $ { TMP TABLE } ( cannon_search_tGxt } ; 



## 

Log 4 add counts of canons to $TMPTABLS 
texis DB=$DB TMPTABLE=$TMPTABLE updatecnt 

Log 5. build url index on $ TMP TABLE 

drop index idxOurl ; 

create index idxOurl on $ { TMP TABLE } ( ad_ur 1 ) ; 

! 

Log 6 - merge crawled text w/ original 
tsql -d ${DB} « i 
drop index idx4url; 

create index idx4url on $ { CRAWLTABLE } ( ad_ur 1 ) ; 
drop table $TMPTABLE2 ; 
CREATE TABLE $TMPTABLE2 
AS 

t?T7T T?nm 

a. price price, 

a. rating rating, 

a . ad_id ad_id , 

a,bid_date bid_date, 

a . 2r r>T * T >~ v> text raw sss 1 "^ -r civf- 

a - canriori_search__text cannon_search__text , 

a . ad_spec_titie ad_spec__title , 

a . ad_spec„desc ad_spec_desc , 

a . ad ur 1 ad_ur 1 , 

a, resource_id resource_id, 

b. crawltitle crawititie, 
b s c r aw 1 rne t a c r a w 1 me t a 

b . c r awl bo dy crawl body , 
a . canon_cnt 

from $ TMP TABLE a, $ CRAWLTABLE b 
where a . adjarl = b.ad__url 
order by price desc; 

i 

fftexis DB=$DB C RAV v LT AB L E = $ CPAWLTABLE TMPTA3L E = $ TMPTABL E update! L 

Log 7. collapse 0 onto 01 
Log 7.1 first -make the table 



WO 01/90947 



48 



PCT/US0i/i6i6i 



tsql -d ${DB} «I 

drop table $( NEWT ABLE} ; 



10 



15 



U t> LJ JL ~\JL Z> \ LJO J : 

QY>garg table $ { ME W T ABL E } ( 

c anon cnt in te g e r ? 

canncn_search_text varchar (50 ) , 

raw„s e arc h_ text varchar ( 50) , 

advert i s er_ids varchar (4096) , 

advertisement integer, 

wo r ds varcha r { 6 5 5 3 6 ) 
) ; 

! 



Log 7,2 sec ond r bu i 1 d the un i q s o rte d list of search t e rms 
Log 7.2 select cannon_search_text from $ {TMPTABLE} ; 
tsql -d $ { DB } «! | sort -u ] tirnport -s termstable . sch -file - 
20 select cannon search... text from $ {TMPTABLE} ; 



Log 7,3 third, build uniq index on terms table 
Log 7.3 create unique index idxterm on terms term 
25 tsql -d ${D3} « I 

create unique index idxterm on terms (term); 



Log 7.3.9 prepare for collapse 
30 Log 7,3,9 create index idxestcol on $ TMPTABLE 2 canriori_search_text 

Log 7.3.9 create index idxadidcoi on $ TMPTABLE 2 ad_id 
tsql -d ${DB} «! 

create index idxestcol on $ TMPTABLE 2 { cannon_search_tsxt ) ; 
create index idxadidcoi on $ TMPTABLE 2 (ad_id) ; 



35 



45 



Log 7.4 fourth collapse around csts 

Log 7.4 texis S ROTABLE =$TMPTABLE2 TGTTABLE=$ NEWT ABLE 
TEP>!STABLE= $ TERMSTABLE collapse 



40 TERMS TABLE = $ TERMSTABL E collapse 



Log S do the porn line 

buildporn j t import -database $DB -table $ NEW TABLE -s rsporn . sch 
file - 

newporn | t imp ort - da t aba s e $ DB - s r s newp orn = sch -file - 

Log 10 metamorph index words column 
50 tsql -d $DB <<! 

create metamorph inverted index mmx${INC}w on $ {NEWTABLE} (words) ; 

Log 10 all done 



55 
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Dumps bid-f or-placement search listings data 

# i /bin/ksh 

5 TXSORAUS ER=XXXXXXX 

TXS ORAPW D =XXXXXXX 

T 1T<JV3\ rDT»Tn-vvvvvv 

J-l V Ui\ V J- VV J-/ — iUl^li^Wl 

# S POOL =p i pe 
SP00L=${1} 

10 S ERVER-XXXXXX 

Log ( ) { 

echo 'xnfffftt' $ (date 1! +%m/%d %H : %M: %S !! ) : "${!}" J #### ! 

15 } 

Log "start dump " 

sqlplus -S $ { TXSORAUS ER } / $ { TXSORAPWD } @1 $ { SERVER } >/ dev./ null «l 
20 //set heading off 

set linesize 750 

set pagesize 0 

set arrays ize 1 

set maxdata 50000 
25 set buffer 50 000 

set crt off; 

set termout off 

spool $ { SPOOL } 

select 

30 rpad ( to_char {advertiser^ id) ,8) I | 

rpad ( raw search ,30) j | 

rpad ( c an on_s earc h ,30) j | 

rpad { title, 100) ] \ 

rpad { description , 2 8 0 ) | | 
35 rpad (ur 1,2 00) | | 

rpad ( res cure e_id / 2 0) | j 

rpad(to_char (price*100) , 5 ) j ! 

rpad (rating, 2) j j 

j-uau \ uu uixaa. v ocai <^ix xu; fO) J | 

4-0 rpad ( resource id, 18) j ! 

rpad ( to_char ( line_ad„id) ,8) ] I 
to_char (bid_date , ' YYYYMMDD ' HHMMSS ! ) 
from ads 

where status - 5 and rating = ' G 1 and canon_search <> ! grab 

45 bag ! 

and rownum < 10000; 
spool Off; 
quit ; 

i 

50 Log "end dump" 

exit 0 
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Counts # of occurances of pagehids for each 
particular potential related search result 



< script language=vortex> 
< timeout = -1></ timeout > 
<a name=main> 

<SQL ROw "select distinct c annon_s e ar ch_t ex t est from n $ TM P TABLE > 

<SQL ROW "select count (*) cnt from " $TMPTABLE » where 
cannon_search_text = $cst"> 

<SQL NOVAKS "update !! $TMP TABLE !! set canon_cnt = $cnt where 
cannon_3Garch _tcxt = $cst " > 
< / SQIj> 
</SQL> 
</SQL> 
</a> 

</script> 
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Aggregates web page body- text and 

result, while collecting and creating 
derived-data of 1, how many different 
advertisers have web— pag - es associated 
with the related-search result. 



< s c r ip t 1 anguage=vo r t ex> 
< time out = -lx/timeout> 
<D3 = / home / got o/cr awl db> 
<a name=main> 

<!-- get all canon-terms from trap table --> 
<8QIj ROVv "select term est from " $ TERI-iSTABIjE > 
< Swords => 

*<<iv~a-rc? ■> 

-v*-*-' 1 -*-' 

<Scsts -> 
<$asts => 
<$asds -> 
<$cts -> 
<$cms => 
<$cbs => 
<$advs => 
<$iast_adv => 
<$ddv_cnt = 0> 

<!-- get all rows w/ this canon term from tmp table --> 
<SQL ROW 

"select c anon_.cn t cc. ad_id aid. raw_search_text rst. 

cannon_search_cext cts, ad_spec_ti tie as t , 

ad_spec_desc asd, crawl title ct. crawl body cb. crawlmeta cm 
from " $ SRC TABLE !! where cannon_search_text = $cst order by 

ad_id" > 

<l~- aggregate the text to prepare for collapsed insert --> 
<$rsts = { firsts + ' 1 4- $rst ) > 
<$rsts = ( $csts + ' 1 + $cst ) > 

^- t*> i_ _ / A* i . I 1 . i*« _ ^ j_ \ 

v. JCtbLM ~ I ^Cl^UiD 1* I" ^CtSL- ) 

<$cts - { $ C t B 4- ! ' 4- $ct 5 > 
< $ cms = ( $cms 4- 1 1 + $cm ) > 
<$cbs = ( $cbs + ! ■ + $cb ) > 

<if $aid i - $iast_adv> 

< I — add a.dvertiser to list if not seen hi in before — > 

<$advs = ( Sadvs 4- ' ] + $aid ) > 

<$adv_cnt - ( $adv_cnt + 1 ) > 

<$last_adv = $aid> 
</if> 
</S0L> 

< & c~ Pi 'n <~\rt r^-n r- — iT Anns 

<$words - { $rsts + 1 ' + Scsts 4- ! ' 4- $e_sts + 1 ! 4- $asds + 1 
$cms 4- • ■ + $cbs + 1 1 4- $cts ) > 

< I — pick off zeroeth element only from $rst array — > 
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<loop $rst> 

<$Rst - $rst> 

<brsak> 
</loop> 

<strlen $ words > 
<$wlen - $ret> 
<strlen $advs> 

<!-- display which row we're working on --> 
$ w 1 en $ r e t $ c s t 

< I — insert to collapsed row - > 

<3QL NOVARS "insert into " $ TGTTAEL E !s {canon_cnt, 
cannon_search_tsxt , raw_search_text , advert i ser_ids , advertisement, 
words) VALUES ($cc, $cst, $Rst, $advs , $adv_cnt, $words) " ></SQL> 

<!— Swords ***<$ret = (text2mm (Swords.. 50) )> $ret --> 
</SQL> 
</a> 

</script> 
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Database schema layout used to upload 
bidded search listings 



database /home / goto / crawl db 
# drop tabic iine_adl 
droptable Iine_ad0 

♦.^kU 14-^^ ~, ^ A 

L.ajJJ.C JLJ.Alg o,v^v 

create table 
col 

Irkeepf irst 

trimspace 



deity f lilt 


yyv-yuuTidd HHMMSS 






# 

default. 

field 
it H 


Name 
_val 
advert i s e r_i d 


Type 

varchar ( S } 


Tag 

1-8 


field 

ir it 


r aw — s ear ch_t ex t 


varchar (40) 


9-48 


field 

II 11 


cannon_search_t ext. varchar (40) 


49-88 


field 
n it 


ad_spec_title 


varchar (100) 


89-188 


field 
ii it 


ad„spec_desc 


varchar (2 000) 


189-2188 


field 
ii ii 


ad_url 


varchar (200) 


2189-2388 


field 

1! 11 


resource_id 


varchar (20) 


2389-2408 


field 


price 


integer 


2409-2413 


0 

field 

1! M 


rating 


char ( 2 ) 


2414-2415 


field 


ad_id 


integer 


2416-2423 


0 

field 


bid_dats 


date 


2424-2438 


0 

field 
field 


canon_cnt 
crawl words 


integer 

varchar (40 ) 
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Manual join of search listing data with crawled 
web page data into a single merged, table 

5 

<script ianguage=vortex> 

< timeout = -1></ timeout > 
<a name = ma in. > 

^-TiTi — >' /'nnmo /rrnf-Pi / r«vaT»Tl rin" S 

10 < S QL ROV v "select ad„ur 1 myurl , c r awl title c t , c r awlme t a cm . c raw I body 

cb from " $ C PAWL TABLE > 

<SQL NOVARS "update !1 $TMP TABLE !f set crawltitle = $ct, crawlmeta = 
$cm, crawibody = Scbwhere ad_url = $ myurl " > 
</SQT.> 
15 </SQL> 

< /a> 

</ script > 



20 
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Code to duplicate URL Crawl elimination 



/ * * 

5 * Insert the type's description here. 

* Creation date: (02/18/2000 11:12:12 AM) 
©author : 

*/ 

import java.io.*; //import of java classes needed for input /output 
10 import j ava . util . * ; 

/ / import core i ava . * ; 

import j ava . lang . String ; 
public class Url { 

15 * Compare URLs Address ' * 

*/ 

public static void main (String args [ ] ) throws Exception 
{ 



20 



25 



30 



50 



55 



// Decalarations of the input and output File 
Buffer edReader input-File; 
Print writer nonDupFile; 
PrintWriter dupFile; 

// Initialization 

String f irstUrl= " 11 ; 
String secondUrl= M " ; 

String urlBuf ferA, urlBuf f erB=" " , urlBuf f erC- " " ; 

String compareDomainA ="»; 
String compareDomainB= M " ; 
String compareDomainC= 11 11 ; 



.M" n nrf r 1 (=ta7h i ;=i rr= " t» i go ■ 



35 input File = new Buff eredReader (new 

FileReader ( " /home/lauw/urls . lau" ) ) ; 



<*X\ ) pi 1 euyr-n teiT ' " / home* / 1 am.? /■nrmTViir^T?! 1 ^ r*Ckr» 1 " \ \ ■ 

■ " .i. — w » » - - — — — \ / **w^i>- / / nviiwupj. J_ J_*=: . JU CO.X Iff 

dupFile — new PrintWriter (new 
FileWriter ( iS /home /lauw/ dupFile . real " } ) ; 

nonDupFile . close { ) ; 
45 dup File . close ( ) ; 

f irstUrl=inputFile . readLine ( } ; 
seconduri=inputFile . readLine ( ) ; 
urlBuf ferC=inputFiie . readLine ( ) ; 



urlBuf ferA— firstUri; 
urlBuf ferB= secondurl ; 



do 
{ 



Slash c c ompar eDoma inA = new Slash ( ) ; 
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Slash ccompareDomainB = new Slash ( ) ; 
Slash ccompareDomainC = new Slash ( ) ; 

c ompar e Dorna inA = c c cmp a r eDoma i nA . S 1 a sh ( ur 1 Bu f f er A ) ; 
5 compareDomainB = ccompareDomainB. Slash (urlBuf ferB) ; 

compareDornainC = ccompareDomainC . Slash (urlBuf ferC) ; 

Compare compareSub = new Compare { ) ; 

1 0 newrlag ^compares ub . Compare { c'ompa r eDomainA ,- compareDomainB , 

compareDornainC, urlBuf ferB, newFlag) ; 
urlBuf ferA-urlBuf ferB ; 
urlBuf f erB=urlBuf f erC; 
ur 1 Bu f f er C = input File . readLine ( ) ; 

15 

} while (urlBuf f erC i -null ) ; 
///////////////////////////Loop for first Null value 

urlBuf f erC=f irstUrl ? 

S lash c compareFir s t Nu 1 1 Dorna inA - new S lash { ) ; 
Slash ccompareFirstNuilDomainB = new SlashO; 
Slash ccompareFirstNullDomainC - new SlashO; 

25 c ompar eDoma inA = c c ompar eF i r s tNul iDomainA . Slash { ur 1 Bu f f er A ) 

compareDomainB = ccompareFirstNuilDomainB . Slash (urlBuf ferB) 
compareDornainC = ccompareFirstNullDomainC . Slash (urlBuf ferC) 



20 



30 



Compare compareFir s tNul 1 Sub = new Compare () ; 

newF Iag=c ompar eF i r s tNul 1 Sub . C ompar e ( c ompar eDoma inA , 
compareDomainB, compareDornainC. urlBufferB. newFlag) ; 
/ / / / / / / / / / / / / / / / / / / / / / / / /Loop for last Nu 1 1 va lue 

1 rj, . jr jr ^. ,_-?«. i -r-,. . SZ -C -n _ 

Li JL ±DUL J_ CSX i-i— I J r IDLlJLieiD: 

35 urlBuf f er3-f irstUrl ; 

ur IBufferC-sec ondUr 1 ; 

Slash ccompareLas tNul IDomainA = new SlashO; 

CI aeV» r-» r< rnnri <~> -t- ~N n i "1 1 nnma -5 -n u — nnt.t Ql fleVi / \ • 

40 Slash ccomoa.reLastrTullDoma.inC = new Slash ( v » • 



45 



c orupar eDoma inA = ccompareLas tNul IDomainA. Slash (urlBuf ferA) 
compareDomainB = ccompareLas tNul iDomainB . Slash (urlBuf ferB) 
compareDornainC = ccompareLas tNul lDomainC . Slash (urlBuf ferC) 



Compare c ompar eLas tNul 1 Sub = new Compare 0 ; 



newFlag=compareLas tNullSub . Compare ( c ompar eDomainA, 
50 compareDomainB. compareDornainC. urlBufferB- newFlag) ; 

inputFile. close ( ) ; 

} 

} 
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class Slashf 

String Slash (String buffer) 

5 { 

in t doma inS 1 a shEnd = 0 ; 
int domains la shGt art = 0; 

boolean doma in Index - false; 
boolean startFound - false; 
10 boolean newFlag = false; 

String comparedomain; 
comparedomain^" " ; 

for (int domainSlashLoop=8 ; domainSlashLoop <- (buffer. length () -1) ; 
1 5 doma in S 1 ashLoop++ ) 

{if 

( (buffer . substring ( domains 1 a shLo op , ( domainSlashLoop 4-1 5 ) - equals ( " / " ) ) 

!! 

(buff er . substring (domainSlashLoop, (domainSlashLoop+1) ) . equals { " ?" ) ) ) 

20 { 

if (startFound-- false) 
{ 

/// Check the Urls with Domain Name only 
if [ ( doma inS 1 ashLoop + 1) ==buf f er , 1 ength { ) ) 
25 { 

comparedomain=buf f er . substring ( 0 , (buffer . length ( ) } ) ; 
domainXndex— true ; 

domainSlashLoop=buf f er . length ( ) + 500 ; 
30 }//end for domain name only 

doma inSlashStart = domainSlashLoop + 1 ; 

startFound - true; 

j 

else 

35 { 

doma inS 1 as hEnd- doma inSi ashLoop ; 
domainSlashLoop = buffer . length ( ) -f 50 0; / / / add 5 
to get out of the loop 
} 

40 

}// end for Loop 
i f { doma in S 1 a s hEnd= = 0 ) 
{ 

domainSlashEnd = buf fer . length () ; 

45 } 

it (domainIndex==f alse) 
{ 

comparedomain=buf f er . substring (domainSlashStart , 
domainSlashEnd) ; 
50 } 
} 

return comparedomain; 

} 

} 

55 
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import java.io.*; //import of java classes needed for input /output 

class Compare 

{ 

5 

String compare (string auompar eDoraainA /# String aCoiuyax eDuiuainB,. 
String aCompareDomainC , String aUrlBufferB, String newFlag) throws 
Exception 
{ 

10 PrintWriter nonDupFile; 

PrintWriter dupFile; 

nonDupFile = new PrintWriter (new 
15 Filewriter ( " /home/ lauw/nonDupFile . real " , true) , true) ; 

dupFile = new PrintWriter (new 
FileWriter (" /home/ lauw/dupFile . real" , true) , true) ; 

i f ( aCompareDomainC . equal s ( aCompareDomainB ) ) 

20 { 

i t ( newF lag. ecrua 1 s ( " t rue " ) ) 
{ 

dupFile .print in ( "New" ) ; 
newFlag = "false"; 

25 } 

System. out .println ( "Duplicate" ) ; 
dupFile. print In (aUrlBuf ferB) ; 

} 

else 

30 { 

i f ( aCompareDomainB . equals ( aCompareDomainA) } 

{ 

if (newF lag. equals ( "true" ) ) 
35 { 

dupF i 1 e. print In { "New" ) ; 
n ewF lag = "false"; 

} 

System. out . print In ("print a Duplicat in second time"}; 
40 System. out . print In ( " Sec Duplicate"); 

dupF ile . print In ( aUrlBu f f er B ) ; 

} 

else 
{ 

45 system. out . print In ( "non Dup"); 

nonDupFi le , print In ( aUr 1 Bu f f erB ) ; 
nevFlag= " true " ; 

} 

System out . or int In ( ir ************************** ) ; 



50 



} 

return newFlag; 



55 } 
} 
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CLAIMS 

a A fv\ a t- V\ r~\ /A r\T rronorofnirr o o at o rs> fannlf liof f-V»i?» tv^/afVirvri />r\mrv^iPir»fr« 

X- O, lllOLllUU WJU C.\--1101UL111£ IX OCdX^ll ICbUXL JLlol, L1IW 111UL11WU. ^WAAALJL lOillg. 

receiving a search request from a searcher; in a pay for performance 
database including a plurality of search listings, identifying search listings 
5 generating a match with the search request; 

in a related search database including related search listing 
generated from the pay for performance database, identifying related search 
listings relevant to the search request; and 

returning a search result list to the searcher including the identified 
10 search listings and one or more of the identified related search listings, 

2. The method of claim 1 wherein identifying related search listings 
comprises: 

searching an inverted index of the pay for performance database; 

and 

15 searching an index of meta-infbrmation based on the pay for 

performance database. 

3. The method of claim i further comprising: 

sorting the identified related search listings by relevancy to the 
search request; 

L>VlVVUIlt CI AAAAAA^VJ- 11UI11U tvl Ul Lilt AVA\_^A1 LIJLIOU A ^ A CALV-/VA JVyCUUll 

listings as most relevant related search listings; and 

returning the most relevant related search listings in the search 

result list, 

4. The method of claim 3 wherein sorting comprises: 

25 selecting the identified related search listings according to 

frequency of occurrence of a queried term from the search request in the related 
search listings, 

5. The method of claim 3 wherein sorting comprises: 
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selecting the identified related search listings according to proximity 
of one or more queried terms from the search request in the related search listings. 

6. The method of claim 3 wherein sorting comprises: 
weighting the related search listings according to predetermined 

5 weighting criteria; and 

selecting the identified related search listings according to the 
weighting of the related search listings, 

7. The method of claim 6 wherein weighting the related search listings 
comprises: 

10 increasing relative weighting of a related search listing which 

includes one or more bided search terms identified by an advertiser. 

8. The method of claim 6 wherein weighting the related search listings 
comprises: 

increasing relative weighting of a related search listing which is 
15 contained in a description of a search listing Identified by an advertiser. 

9. The method of claim 6 wherein weighting the related search listings 
comprises: 

increasing relative weighting of a related search listing which is 
contained in a title of a search listing identified by an advertiser, 

20 10. The method of claim 6 wherein weighting the related search listings 

comprises: 

increasing relative weighting of a related search listing which is 
contained in met a rag keywords of a web page maintained by an advertiser. 

1 1 . The method of claim 6 wherein weighting the related search listings 
25 comprises: 

increasing relative weighting of a related search listing which is 
contained in text data of a web page maintained by an advertiser. 
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12. The method of claim 3 wherein sorting comprises: 

ranking the related search listings according to spread of the related 
search listings; and 

selecting the identified related search listings according to the 
ranking of the related search listings, 

13. The method of claim 12 wherein ranking comprises: 
identifying key information contained in the related search listings; 

and 

increasing ranking of a related search listing according to presence 
of the key information in the related search listing. 

14. The method of claim 13 wherein identifying key information 
comprises: 

detecting fielded advertiser data in the related search listing; and 
detecting crawled data in the related search listing. 

15. A system comprising: 

a pay for performance database; 

a related search database formed at least in part using the pay for 
performance database; and 

a server coupled with the pay for performance database and the 
related search database, the server operative to select a first set of search results 
from the pay for performance database and a second set of search results from the 
related search database in response to a search request from a searcher. 

16. The system of claim 15 wherein the pay for performance database 
comprises: 

a plurality of search listings, each search listing including 
a search term, 
,a bid amount, and 
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a Uniform Resource Locator corresponding to an address of a 
document on a network server remote from the system. 

17. The system of claim 16 wherein the related search database 
comprises: 

5 a plurality of related search listings, each related search listing 

including 

a keyword associated with one document of the pay for performance 

database, and 

text of the one document. 

10 18. The system of claim 17 wherein each search listing of the plurality 

of search listings further comprises: 

descriptive text describing the one document, 
a title, and 

metatags associated with the document. 

15 19= The system of claim 1 8 wherein each search listing comprises: 

the descriptive text associated with the one document; 
the title associated with the one document; and 
the metatags associated, with the one document. 

20. A method for forming a related searches database for identifying 
20 related searches in response to a search request to a pay for performance database 

including a plurality of search listings, the method comprising: 

referenced by a search listing of the pay for performance database: 

creating an inverted index for the related search database entries; 

25 and 

creating an index for key information associated with each search 
listing of the pay for performance database. 

2 1 . The method of claim 20 wherein storing comprises: 
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identifying similar web pages responsive to root path components 
ana query arguments of Uniform Resource Locators for two or more web pages 
referenced by search listings of the pay for performance database; 

rejecting for storage similar web pages, 

5 22. The method of claim 21 wherein identifying similar web pages 

comprises: 

identifying first key words of a first web page; 
identifying second key words of a second web page; and 
comparing the first key words and the second key words: 
10 when the first ke w words and the second ke v words have a 

predetermined relationship, identifying the first web page and the second web 

page as similar web pages. 

23. A method for searching data in a database including internet data 
15 from internet web sites, the method comprising: 

forming a list of uniform resource locators (URLs) associated with internet 

web sites to be accessed; 
removing duplicate URLs from the list; 

if a URL on the list is similar icy another URL on the list, crawling a 
20 predetermined number of potentially duplicate URLs: 

comparing bodies of the URL on the list and the potentially duplicate 
URLs; 

if the body of the URL on the list is similar to the body of the potentially 
duplicate URL, 

25 suspending crawling of the potentially duplicate URLs, and 

storing the body of the URL on the list in the database for 
subsequent search. 
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24. The method of claim 23 further comprising: 
comparing a selected URL with other URLs on the list; and 
determining the URL is similar to the other URL on the list when the URL 

has a predetermined text portion in common with the other URL on 
5 the list, 

25. The method of claim 23 wherein comparing bodies of the URL on 
the list and the potentially duplicate URLs comprises: 

comparing text from the URL or. the list and text from one potentially 
10 duplicate URL: and 

determining the URL on the list is similar to the one potentially duplicate 
URL when the text from the URL on the list and the text from the 
one potentially duplicate URL have a predetermined text portion in 
common. 



15 
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